[Zope] Re: Running more than one instance on windows often block each other

Sune B. Woeller sune at syntetisk.dk
Thu Jul 28 09:27:20 EDT 2005


Tim Peters wrote:
> It's starting to look a lot like the Windows bind() implementation is
> unreliable, sometimes (but rarely -- hard to provoke) allowing two
> sockets to bind to the same (address, port) pair simultaneously,
> instead of raising 'Address already in use' for one of them.  Disaster
> ensues.
> 
> WRT the last version of the code I posted, on another XP Pro SP2
> machine (again after playing registry games to boost the number of
> ephemeral ports) I eventually saw all of:  hangs during accept(); the
> assertion errors I mentioned last time; and mystery "Connection
> refused" errors during connect().
> 
> The variant of the code below _only_ tries to use port 19999.  If it
> can't bind to that on the first try, socktest111() raises an exception
> instead of trying again (or trying a different port number).  Ran two
> processes.  After about 15 minutes, both died with assert errors at
> about the same time (identical, so far as I could tell by eyeball):
> 
> Process A:
> 
> Traceback (most recent call last):
>   File "socktest.py", line 209, in ?
>     assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
> AssertionError: ('292739', '821744', ('127.0.0.1', 19999), ('127.0.0.1', 3845))
> 
> Process B:
> 
> Traceback (most recent call last):
>   File "socktest.py", line 209, in ?
>     assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
> AssertionError: ('821744', '292739', ('127.0.0.1', 19999), ('127.0.0.1', 3846))
> 
> So it's again the business where each process is recv'ing the random
> string intended to be recv'ed by a socket in the other process. 
> Hypothesized timeline:
> 
> process A's `a` binds to 19999
> process B's `a` binds to 19999 -- according to me, this should be impossible
>     in the absence of SO_REUSEADDR (which acts very differently on
>     Windows than it does on Linux, BTW -- on Linux this should be impossible
>     even in the presence of SO_REUSEADDR; regardless, we're not using
>     SO_REUSEADDR here, and the braindead hard-coded
> 
>         w.setsockopt(socket.IPPROTO_TCP, 1, 1)
> 
>     is actually using the right magic constant for TCP_NODELAY on
>     Windows, as it intends).
> A and B both listen()
> A connect()s, and accidentally gets on B.a's accept queue
> B connect()s, and accidentally gets on A.a's accept queue
> the rest follows inexorably
> 



This is what I'm experiencing as well.
I can narrow it down a bit: I *always* experience one out of two
erroneous behaviours, as described below.

I tried to make an even simpler test situation, without binding
sockets 'r' and 'w' to each other in the same process. I try to
reproduce the problem in a 'standard' socket use case, where a client
in one process binds to a server in another process.

The following two scripts acts as a server and a client.

#***********************
# sock_server_reader.py
#***********************
import socket

a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)

a.bind(("127.0.0.1", 19999))
print a.getsockname()  # assigned (host, port) pair

a.listen(1)

print "a accepting:"
r, addr = a.accept()  # r becomes asyncore's (self.)socket
print "a accepted: "
print ' ' + str(r.getsockname()) + ', peer=' + str(r.getpeername())

a.close()

msg = r.recv(100)
print 'msg recieved:', msg


#***********************
# sock_client_writer.py
#***********************
import socket, random

w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
w.setsockopt(socket.IPPROTO_TCP, 1, 1)

print 'w connecting:'
w.connect(('127.0.0.1', 19999))
print 'w connected:'
print w.getsockname()
print ' ' + str(w.getsockname()) + ', peer=' + str(w.getpeername())
msg = str(random.randrange(1000000))
print 'sending msg: ', msg
w.send(msg)




There are two possible outcomes [a) and b)] of running two instances
of this client/server pair (that is, 4 processes in total like the
following).
(Numbers 1 to 4 are steps executed in chronological order.)

1) python -i sock_server_reader.py
The server prints:
	('127.0.0.1', 19999)
	a accepting:
and waits for a connection

2) python -i sock_client_writer.py
The client prints:
	w connecting:
	w connected:
	('127.0.0.1', 3774)
	 ('127.0.0.1', 3774), peer=('127.0.0.1', 19999)
	sending msg:  903848
	>>>

and the server now accepts the connection and prints:
	a accepted:
	 ('127.0.0.1', 19999), peer=('127.0.0.1', 3774)
	msg recieved: 903848
	>>>

This is like it should be. Then lets try to setup a second
client/server pair, on the same port (19999). The expected outcome of
this is that the bind() call in sock_server_reader.py should fail with
socket.error: (10048, 'Address already in use').

3) python -i sock_server_reader.py
The server prints:
	('127.0.0.1', 19999)
	a accepting:

Already here the problem occurs, bind() is allowed to bind to a port
that is in use, in this case by the client socket 'r'.
[also on other windows ? Mikkel: yes. Diku:???]

4) python -i sock_client_writer.py
Now one out of two things happen:

a) The client prints:
	w connecting:
	Traceback (most recent call last):
	  File "c:\pyscripts\sock_client_writer.py", line 7, in ?
	    w.connect(('127.0.0.1', 19999))
	  File "<string>", line 1, in connect
	socket.error: (10061, 'Connection refused')
	>>>
    The server waits on the call to accept(), still waiting for a
connection. (This is the blocking behaviour I reported in my first
mail, experienced when running two zope instances. The socket error
was swallowed by the unconditional except clause).

b) The client connects to the server:
	w connecting:
	w connected:
	('127.0.0.1', 3865)
	 ('127.0.0.1', 3865), peer=('127.0.0.1', 19999)
	sending msg:  119105
	>>>

and the server now accepts the connection and prints:
	a accepted:
	 ('127.0.0.1', 19999), peer=('127.0.0.1', 3865)
	msg recieved: 119105
	>>>

The second set of client/server processes are now connected on the
same port as the first set of client/server processes. In a port
scanner the port now belongs two the second server process [3)].


I always get one out of these two possibilities (a and b), I never
see bind() raising socket.error: (10048, 'Address already in use').

It is important to realize that both these outcomes are an error.

I tried the same process as above on a linux system, and 3) always
raises (10048, 'Address already in use').


If case a) occured, where w.connect raises socket.error: (10061,
'Connection refused'), trying to run a third client/server pair, the
bind() call raises (10048, 'Address already in use'). The 'a'-socket
from the second pair of processes is not closed in this case, but
still trying to accept().

In my case bind() always raises (10048, 'Address already in use') when
there is an open server socket like 'a' bound to the same port.

To summarize:
Closing a server socket bound to a given port, alows another server
socket to bind to the same port, even when there are open client
sockets bound to the port.





> Note that because this never tries a port number other than 19999, it
> can't be a bulletproof workaround simply to hold on to the `a` socket.
>  If the hypothesized timeline above is right, bind() can't be trusted
> on Windows in any situation where two processes may try to bind to the
> same hostname:port pair at the same time.  Holding on to `a`, and
> cycling through port numbers when bind() failed, would still
> potentially leave two processes trying to bind to the same port number
> simultaneously (just a port other than 19999).
> 

It would not be enough to keep a reference to 'a'. It would have to be
kept open as well. And maybe that is not a problem, since we only
accept() once - only one 'w' client socket would be able to be
accepted. Normally the use case for closing the server socket is to
disallow more connections than those already acceptet.
(But I'm not so experienced with sockets, I might be wrong.)


> Ick:  this happens under Pythons 2.3.5 (MSVC 6) and 2.4.1 (MSVC 7.1),
> so if it is -- as is looking more and more likely --an error in MS's
> socket implementation, it isn't avoided by switching to a newer MS C
> library.
> 
> Frankly, I don't see a sane way to worm around this -- it's difficult
> for application code to worm around what smells like a missing
> critical section in system code.
> 
> Using the simpler socket dance from the ZODB 3.4 code, I haven't yet
> seen an instance of the assert failure, or a hang.  However, let two
> processes run that long enough simultaneously, and it always (so far)
> eventually fails with
> 
>     socket.error: (10048, 'Address already in use')
> 
> in the w.connect() call, and despite that Windows picks the port numbers here!
> 
That is exactly what I feared could happen. As shown in my example
above, the other that might happen is that the port is 'taken over' by
the other process.


> While that also smells to heaven of a missing critical section in the
> Windows socket implementation, an exception is much easier to live
> with / worm around.  Alas, we don't have the MS source code, and I
> don't have time to try disassembling / reverse-engineering the opcodes
> (what EULA <wink>?), so best I can do is run this for many more hours
> to try to increase confidence that an exception is the worst that can
> occur under the ZODB 3.4 spelling.
> 
> Here's full code for the "only try port 19999" version:
> 
> import socket, errno
> import time, random
> def socktest111():
>     """Raise an exception if we can't get 19999.
>     """
> 
>     a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
>     w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
> 
>     # set TCP_NODELAY to true to avoid buffering
>     w.setsockopt(socket.IPPROTO_TCP, 1, 1)
> 
>     # tricky: get a pair of connected sockets
>     host = '127.0.0.1'
>     port = 19999
> 
>     try:
>         a.bind((host, port))
>     except:
>         raise RuntimeError
>     else:
>         print 'b',
> 
>     a.listen (1)
>     w.setblocking (0)
>     try:
>         w.connect ((host, port))
>     except:
>         pass
>     print 'c',
>     r, addr = a.accept()
>     print 'a',
>     a.close()
>     print 'c',
>     w.setblocking (1)
> 
>     return (r, w)
> 
> sofar = []
> try:
>    while 1:
>        try:
>            stuff = socktest111()
>        except RuntimeError:
>            print 'x',
>            time.sleep(random.random()/10)
>            continue
>        sofar.append(stuff)
>        time.sleep(random.random()/10)
>        if len(sofar) == 50:
>            tup = sofar.pop(0)
>            r, w = tup
>            msg = str(random.randrange(1000000))
>            w.send(msg)
>            msg2 = r.recv(100)
>            assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
>            for s in tup:
>                s.close()
> except KeyboardInterrupt:
>    for tup in sofar:
>        for s in tup:
>            s.close()
> _______________________________________________
> Zope maillist  -  Zope at zope.org
> http://mail.zope.org/mailman/listinfo/zope
> **   No cross posts or HTML encoding!  **
> (Related lists - 
>  http://mail.zope.org/mailman/listinfo/zope-announce
>  http://mail.zope.org/mailman/listinfo/zope-dev )
> 



More information about the Zope mailing list