[Zope] Re: Running more than one instance on windows often block each other

Sune B. Woeller
Thu Jul 28 10:09:54 EDT 2005

btw, the code is slightly modified versions of the getting started with Winsock 

Sune B. Woeller wrote:
> I have made two similar testprograms in c++, and the problem also occurs 
> there. Exactly the same pattern as my python client/server scripts in 
> the mail I am replying to.
> But then I stumbled upon this flag in the WinSock documentation: 
> See the description here:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/using_so_exclusiveaddruse.asp 
> It is very interesting reading, especially:
> "An important caveat to using the SO_EXCLUSIVEADDRUSE option exists: If 
> one or more connections originating from (or accepted on) a port bound 
> with SO_EXCLUSIVEADDRUSE is active, all bind attempts to that port will 
> fail."
> This is just what we want (and I think that is standard behaviour on 
> Linux).
> I have tested it with my c+ programs, and when i set that option on the 
> server socket before the bind(), it works, bind() in the second server 
> process fails with WSAEADDRINUSE
> (bind() failed: 10048.)
> There is a python bugfix for this, but only for python 2.4:
> http://sourceforge.net/tracker/index.php?func=detail&aid=982665&group_id=5470&atid=305470 
> (It is added to version 1.294 of socketmodule.c)
> I run the two test programs from two cmd terminals, like I described for 
> the python versions.
> // link with ws2_32.lib
> //sock_server.cpp
> #include <cstdlib>
> #include <stdio.h>
> #include <conio.h>
> #include "winsock2.h"
> void main() {
>     // Initialize Winsock.
>     WSADATA wsaData;
>     int iResult = WSAStartup( MAKEWORD(2,2), &wsaData );
>     if ( iResult != NO_ERROR )
>         printf("Error at WSAStartup()\n");
>     // Create a socket.
>     SOCKET m_socket;
>     m_socket = socket( AF_INET, SOCK_STREAM, IPPROTO_TCP );
>     if ( m_socket == INVALID_SOCKET ) {
>         printf( "Error at socket(): %ld\n", WSAGetLastError() );
>         WSACleanup();
>         return;
>     }
>     // try to use SO_EXCLUSIVEADDRUSE
>     BOOL bOptVal = TRUE;
>     int bOptLen = sizeof(BOOL);
>     if (setsockopt(m_socket, SOL_SOCKET, SO_EXCLUSIVEADDRUSE, 
> (char*)&bOptVal, bOptLen) != SOCKET_ERROR) {
>         printf("Set SO_EXCLUSIVEADDRUSE: ON\n");
>       }
>     // Bind the socket.
>     sockaddr_in service;
>     service.sin_family = AF_INET;
>     service.sin_addr.s_addr = inet_addr( "" );
>     service.sin_port = htons( 19990 );
>     if ( bind( m_socket, (SOCKADDR*) &service, sizeof(service) ) == 
>         printf( "bind() failed: %i.\n", WSAGetLastError() );
>         closesocket(m_socket);
>         return;
>     }
>     // Listen on the socket.
>     if ( listen( m_socket, 1 ) == SOCKET_ERROR )
>         printf( "Error listening on socket.\n");
>     // Accept connections.
>     SOCKET AcceptSocket;
>     printf( "Waiting for a client to connect...\n" );
>     while (1) {
>         AcceptSocket = SOCKET_ERROR;
>         while ( AcceptSocket == SOCKET_ERROR ) {
>             AcceptSocket = accept( m_socket, NULL, NULL );
>         }
>         printf( "Client Connected.\n");
>         //m_socket = AcceptSocket;
>         break;
>     }
>     closesocket(m_socket);
>     // Send and receive data.
>     int bytesRecv = SOCKET_ERROR;
>     char recvbuf[32] = "";
>     bytesRecv = recv( AcceptSocket, recvbuf, 32, 0 );
>     printf( "Bytes Recv: %ld\n", bytesRecv );
>     printf("Recieved: %s\n", recvbuf);
>     printf("press a key to terminate\n");
>     getch();
>     return;
> }
> //sock_client.cpp
> #include <stdio.h>
> #include <conio.h>
> #include "winsock2.h"
> void main() {
>     // Initialize Winsock.
>     WSADATA wsaData;
>     int iResult = WSAStartup( MAKEWORD(2,2), &wsaData );
>     if ( iResult != NO_ERROR )
>         printf("Error at WSAStartup()\n");
>     // Create a socket.
>     SOCKET m_socket;
>     m_socket = socket( AF_INET, SOCK_STREAM, IPPROTO_TCP );
>     if ( m_socket == INVALID_SOCKET ) {
>         printf( "Error at socket(): %ld\n", WSAGetLastError() );
>         WSACleanup();
>         return;
>     }
>     // Connect to a server.
>     sockaddr_in clientService;
>     clientService.sin_family = AF_INET;
>     clientService.sin_addr.s_addr = inet_addr( "" );
>     clientService.sin_port = htons( 19990 );
>     if ( connect( m_socket, (SOCKADDR*) &clientService, 
> sizeof(clientService) ) == SOCKET_ERROR) {
>         printf( "Failed to connect.\n" );
>         WSACleanup();
>         return;
>     }
>     // Send and receive data.
>     int bytesSent;
>     char sendbuf[32] = "";
>     printf("Enter string to send (max 30 bytes):\n");
>     scanf("%s", sendbuf );
>     printf("Sending: %s\n", sendbuf);
>     bytesSent = send( m_socket, sendbuf, strlen(sendbuf), 0 );
>     printf( "Bytes Sent: %ld\n", bytesSent );
>     printf("press a key to terminate\n");
>     getch();
>     return;
> }
> Sune B. Woeller wrote:
>> Tim Peters wrote:
>>> It's starting to look a lot like the Windows bind() implementation is
>>> unreliable, sometimes (but rarely -- hard to provoke) allowing two
>>> sockets to bind to the same (address, port) pair simultaneously,
>>> instead of raising 'Address already in use' for one of them.  Disaster
>>> ensues.
>>> WRT the last version of the code I posted, on another XP Pro SP2
>>> machine (again after playing registry games to boost the number of
>>> ephemeral ports) I eventually saw all of:  hangs during accept(); the
>>> assertion errors I mentioned last time; and mystery "Connection
>>> refused" errors during connect().
>>> The variant of the code below _only_ tries to use port 19999.  If it
>>> can't bind to that on the first try, socktest111() raises an exception
>>> instead of trying again (or trying a different port number).  Ran two
>>> processes.  After about 15 minutes, both died with assert errors at
>>> about the same time (identical, so far as I could tell by eyeball):
>>> Process A:
>>> Traceback (most recent call last):
>>>   File "socktest.py", line 209, in ?
>>>     assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
>>> AssertionError: ('292739', '821744', ('', 19999), 
>>> ('', 3845))
>>> Process B:
>>> Traceback (most recent call last):
>>>   File "socktest.py", line 209, in ?
>>>     assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
>>> AssertionError: ('821744', '292739', ('', 19999), 
>>> ('', 3846))
>>> So it's again the business where each process is recv'ing the random
>>> string intended to be recv'ed by a socket in the other process. 
>>> Hypothesized timeline:
>>> process A's `a` binds to 19999
>>> process B's `a` binds to 19999 -- according to me, this should be 
>>> impossible
>>>     in the absence of SO_REUSEADDR (which acts very differently on
>>>     Windows than it does on Linux, BTW -- on Linux this should be 
>>> impossible
>>>     even in the presence of SO_REUSEADDR; regardless, we're not using
>>>     SO_REUSEADDR here, and the braindead hard-coded
>>>         w.setsockopt(socket.IPPROTO_TCP, 1, 1)
>>>     is actually using the right magic constant for TCP_NODELAY on
>>>     Windows, as it intends).
>>> A and B both listen()
>>> A connect()s, and accidentally gets on B.a's accept queue
>>> B connect()s, and accidentally gets on A.a's accept queue
>>> the rest follows inexorably
>> This is what I'm experiencing as well.
>> I can narrow it down a bit: I *always* experience one out of two
>> erroneous behaviours, as described below.
>> I tried to make an even simpler test situation, without binding
>> sockets 'r' and 'w' to each other in the same process. I try to
>> reproduce the problem in a 'standard' socket use case, where a client
>> in one process binds to a server in another process.
>> The following two scripts acts as a server and a client.
>> #***********************
>> # sock_server_reader.py
>> #***********************
>> import socket
>> a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
>> a.bind(("", 19999))
>> print a.getsockname()  # assigned (host, port) pair
>> a.listen(1)
>> print "a accepting:"
>> r, addr = a.accept()  # r becomes asyncore's (self.)socket
>> print "a accepted: "
>> print ' ' + str(r.getsockname()) + ', peer=' + str(r.getpeername())
>> a.close()
>> msg = r.recv(100)
>> print 'msg recieved:', msg
>> #***********************
>> # sock_client_writer.py
>> #***********************
>> import socket, random
>> w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
>> w.setsockopt(socket.IPPROTO_TCP, 1, 1)
>> print 'w connecting:'
>> w.connect(('', 19999))
>> print 'w connected:'
>> print w.getsockname()
>> print ' ' + str(w.getsockname()) + ', peer=' + str(w.getpeername())
>> msg = str(random.randrange(1000000))
>> print 'sending msg: ', msg
>> w.send(msg)
>> There are two possible outcomes [a) and b)] of running two instances
>> of this client/server pair (that is, 4 processes in total like the
>> following).
>> (Numbers 1 to 4 are steps executed in chronological order.)
>> 1) python -i sock_server_reader.py
>> The server prints:
>>     ('', 19999)
>>     a accepting:
>> and waits for a connection
>> 2) python -i sock_client_writer.py
>> The client prints:
>>     w connecting:
>>     w connected:
>>     ('', 3774)
>>      ('', 3774), peer=('', 19999)
>>     sending msg:  903848
>>     >>>
>> and the server now accepts the connection and prints:
>>     a accepted:
>>      ('', 19999), peer=('', 3774)
>>     msg recieved: 903848
>>     >>>
>> This is like it should be. Then lets try to setup a second
>> client/server pair, on the same port (19999). The expected outcome of
>> this is that the bind() call in sock_server_reader.py should fail with
>> socket.error: (10048, 'Address already in use').
>> 3) python -i sock_server_reader.py
>> The server prints:
>>     ('', 19999)
>>     a accepting:
>> Already here the problem occurs, bind() is allowed to bind to a port
>> that is in use, in this case by the client socket 'r'.
>> [also on other windows ? Mikkel: yes. Diku:???]
>> 4) python -i sock_client_writer.py
>> Now one out of two things happen:
>> a) The client prints:
>>     w connecting:
>>     Traceback (most recent call last):
>>       File "c:\pyscripts\sock_client_writer.py", line 7, in ?
>>         w.connect(('', 19999))
>>       File "<string>", line 1, in connect
>>     socket.error: (10061, 'Connection refused')
>>     >>>
>>    The server waits on the call to accept(), still waiting for a
>> connection. (This is the blocking behaviour I reported in my first
>> mail, experienced when running two zope instances. The socket error
>> was swallowed by the unconditional except clause).
>> b) The client connects to the server:
>>     w connecting:
>>     w connected:
>>     ('', 3865)
>>      ('', 3865), peer=('', 19999)
>>     sending msg:  119105
>>     >>>
>> and the server now accepts the connection and prints:
>>     a accepted:
>>      ('', 19999), peer=('', 3865)
>>     msg recieved: 119105
>>     >>>
>> The second set of client/server processes are now connected on the
>> same port as the first set of client/server processes. In a port
>> scanner the port now belongs two the second server process [3)].
>> I always get one out of these two possibilities (a and b), I never
>> see bind() raising socket.error: (10048, 'Address already in use').
>> It is important to realize that both these outcomes are an error.
>> I tried the same process as above on a linux system, and 3) always
>> raises (10048, 'Address already in use').
>> If case a) occured, where w.connect raises socket.error: (10061,
>> 'Connection refused'), trying to run a third client/server pair, the
>> bind() call raises (10048, 'Address already in use'). The 'a'-socket
>> from the second pair of processes is not closed in this case, but
>> still trying to accept().
>> In my case bind() always raises (10048, 'Address already in use') when
>> there is an open server socket like 'a' bound to the same port.
>> To summarize:
>> Closing a server socket bound to a given port, alows another server
>> socket to bind to the same port, even when there are open client
>> sockets bound to the port.
>>> Note that because this never tries a port number other than 19999, it
>>> can't be a bulletproof workaround simply to hold on to the `a` socket.
>>>  If the hypothesized timeline above is right, bind() can't be trusted
>>> on Windows in any situation where two processes may try to bind to the
>>> same hostname:port pair at the same time.  Holding on to `a`, and
>>> cycling through port numbers when bind() failed, would still
>>> potentially leave two processes trying to bind to the same port number
>>> simultaneously (just a port other than 19999).
>> It would not be enough to keep a reference to 'a'. It would have to be
>> kept open as well. And maybe that is not a problem, since we only
>> accept() once - only one 'w' client socket would be able to be
>> accepted. Normally the use case for closing the server socket is to
>> disallow more connections than those already acceptet.
>> (But I'm not so experienced with sockets, I might be wrong.)
>>> Ick:  this happens under Pythons 2.3.5 (MSVC 6) and 2.4.1 (MSVC 7.1),
>>> so if it is -- as is looking more and more likely --an error in MS's
>>> socket implementation, it isn't avoided by switching to a newer MS C
>>> library.
>>> Frankly, I don't see a sane way to worm around this -- it's difficult
>>> for application code to worm around what smells like a missing
>>> critical section in system code.
>>> Using the simpler socket dance from the ZODB 3.4 code, I haven't yet
>>> seen an instance of the assert failure, or a hang.  However, let two
>>> processes run that long enough simultaneously, and it always (so far)
>>> eventually fails with
>>>     socket.error: (10048, 'Address already in use')
>>> in the w.connect() call, and despite that Windows picks the port 
>>> numbers here!
>> That is exactly what I feared could happen. As shown in my example
>> above, the other that might happen is that the port is 'taken over' by
>> the other process.
>>> While that also smells to heaven of a missing critical section in the
>>> Windows socket implementation, an exception is much easier to live
>>> with / worm around.  Alas, we don't have the MS source code, and I
>>> don't have time to try disassembling / reverse-engineering the opcodes
>>> (what EULA <wink>?), so best I can do is run this for many more hours
>>> to try to increase confidence that an exception is the worst that can
>>> occur under the ZODB 3.4 spelling.
>>> Here's full code for the "only try port 19999" version:
>>> import socket, errno
>>> import time, random
>>> def socktest111():
>>>     """Raise an exception if we can't get 19999.
>>>     """
>>>     a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
>>>     w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
>>>     # set TCP_NODELAY to true to avoid buffering
>>>     w.setsockopt(socket.IPPROTO_TCP, 1, 1)
>>>     # tricky: get a pair of connected sockets
>>>     host = ''
>>>     port = 19999
>>>     try:
>>>         a.bind((host, port))
>>>     except:
>>>         raise RuntimeError
>>>     else:
>>>         print 'b',
>>>     a.listen (1)
>>>     w.setblocking (0)
>>>     try:
>>>         w.connect ((host, port))
>>>     except:
>>>         pass
>>>     print 'c',
>>>     r, addr = a.accept()
>>>     print 'a',
>>>     a.close()
>>>     print 'c',
>>>     w.setblocking (1)
>>>     return (r, w)
>>> sofar = []
>>> try:
>>>    while 1:
>>>        try:
>>>            stuff = socktest111()
>>>        except RuntimeError:
>>>            print 'x',
>>>            time.sleep(random.random()/10)
>>>            continue
>>>        sofar.append(stuff)
>>>        time.sleep(random.random()/10)
>>>        if len(sofar) == 50:
>>>            tup = sofar.pop(0)
>>>            r, w = tup
>>>            msg = str(random.randrange(1000000))
>>>            w.send(msg)
>>>            msg2 = r.recv(100)
>>>            assert msg == msg2, (msg, msg2, r.getsockname(), 
>>> w.getsockname())
>>>            for s in tup:
>>>                s.close()
>>> except KeyboardInterrupt:
>>>    for tup in sofar:
>>>        for s in tup:
>>>            s.close()
