[ZODB-Dev] ZODB 3.1: Race condition in ZEO reconnect protocol

Fri Feb 4 15:28:35 EST 2005

[Dieter Maurer, from 2003-10-02
 (I may not have time to respond or act, but I don't forget <wink>)
]
> On our fastest servers (and only there) (Dual 2 GHz AMD, 1 GB RAM, Linux,
> Python 2.1.3, ZODB 3.1 CVS from "Zope_2.6branch") the reconnect tests in
> "ZEO/tests/testConnection.py" fail non-deterministically but with high
> probability (in more than 95 % one of the 6 reconnect tests fails; which
> one is chosen randomly).
>
> Apparently, there is a race condition.
...

Turns out there was more than one race, and these were exacerbated in the
test suite by that the scaffolding's pollUp() method was sometimes an
unintentional pure busy-loop (and so could starve the thread trying to make
a connection).

I just checked in a pile of fixes, on the Zope-2_7-branch branch of the
ZODB3 module in CVS.  That will eventually become ZODB 3.2.6.  I have not
ported these to the 3.3 line yet (neither trunk nor 3.3.1 maintenance
branch), but will soon (I hope today).  I don't intend ever to backport them
to the 3.1 line.

There are 8 ZEO reconnection tests (2 don't run unless you pass --all to
test.py).  I have one Windows box and one Linux box where it was extremely
rare for all 8 to pass.  It seems they never fail on the Windows box (3.4
GHz and hyper-threaded) after the patches, or at least not after hours of
running in a loop.  Too soon to say wrt the Linux box, but they all passed
10 times in a row there so far (but almost never passed even "once in a row"
before), and run 3-4x faster than they did before.  The Linux speedup is
mostly due to repairing pollUp(); the Windows box runtime didn't change at
all (it's hard to provoke Windows into starving threads short of changing
their priorities).