[Zope] ZEO troubles on RedHat EL4 Linux

Tim Peters tim.peters at gmail.com
Fri Aug 19 10:58:35 EDT 2005


[Willi Langenberger]
> Ok, here some data points...
>
>  bender:~/Zope-2.7.7-final$ cat /proc/version
>  Linux version 2.6.9-11.ELsmp (bhcompile at decompose.build.redhat.com) (gcc version 3.4.3 20050227 (Red Hat 3.4.3-22)) #1 SMP Fri May 20 18:26:27 EDT 2005
>
>  bender:~/Zope-2.7.7-final$ python2.3
>  Python 2.3.5 (#1, Apr 19 2005, 14:53:39)
>  [GCC 3.4.3 20041212 (Red Hat 3.4.3-9.EL4)] on linux2
>  ...
> 
> Running one single test:
> 
>  bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection checkNoVerificationOnServerRestart\$
>  Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python
>  ======================================================================
>  ERROR: checkNoVerificationOnServerRestart (ZEO.tests.testConnection.FileStorageReconnectionTests)
>  ----------------------------------------------------------------------
>  Traceback (most recent call last):
>    File "/home/wlang/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown
>      os.waitpid(pid, 0)
>  OSError: [Errno 10] No child processes
> 
>  ----------------------------------------------------------------------
>  Ran 1 test in 0.689s
> 
>  FAILED (errors=1)
> 
> After some retries, the same test passes:
> 
>  bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection checkNoVerificationOnServerRestart\$
>  Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python
>  ----------------------------------------------------------------------
>  Ran 1 test in 0.691s
> 
>  OK
>
> Interesstingly, if i run the test with strace, i never see the test
> fail (i tried at least 30 times):
>
>  bender:~/Zope-2.7.7-final$ strace -e trace=signal -o /var/tmp/zeotest.trc python2.3 test.py testConnection checkNoVerificationOnServerRestart\$
>  Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python
>  ----------------------------------------------------------------------
>  Ran 1 test in 0.710s
> 
>  OK
> 
> (Obviously a Heisenberg effect -- the observation influences the
> behaviour ;-)

Not unusual, alas.  What's more peculiar is that nobody else on this
list reports the same problem.  Then again, we have no way to know
whether anyone other than Jens and I _tried_ to ;-)

> If anyone is interessted in the trace file -- it can be found at:
>
>  http://slime.wu-wien.ac.at/misc/zeotest.trc
>
> (However, it would be way more interessting to see the syscalls while
> the test is failing...)

Someone more up-to-date than I on the vagaries of Linux signals and
strace might be able to deduce something from that about how SIGCHLD
is treated by this OS.  AFAICT, the SIGCHLD handler was set to SIG_DFL
shortly after Python started, and wasn't fiddled with again.  I don't
know the exact intended meaning of every character in the:

--- SIGCHLD (Child exited) @ 0 (0) ---

lines near the end of the trace either.

> Also, i debugged the whole test with the python debugger. Unfortunatly
> (as with strace), i was not able to reproduce the failing of the test
> in the debugger.

[Tim]
>> the ZEO tests spawn processes directly via Python's
>> os.spawnve(), and later waits for them to end, via the waitpid() code
>> shown earlier.  It doesn't muck around with signals, forks, or
>> anything else that should be platform-dependent (the same ZEO-test
>> process code is used on both Linux and Windows, BTW -- for this
>> reason, it can't rely on any fancy signal or process gimmicks;
>> spawnve+watipid is the entire story here).

> Yes, its as simple as that: zeo ist started, zeo is stopped, and when
> the parent calls waitpid, we get the "No child processes" error most of
> the time :-(
> 
> Any ideas what we can try to narrow this down?

Whittle it down.  If I had a box on which I saw the problem, the next
thing I'd try is writing a tiny Python program that did nothing other
than spawn a simple process and then wait for it finish.  So far,
there's no particular reason to believe that the mountain of
Zope/ZODB/ZEO code really has anything to do with this, right?  The
outcome of trying to remove all that from the equation would suggest a
next step.

...

> Sure -- we could just make this change:
> 
>  bender:.../ZEO/tests$ diff ConnectionTests.py.ori ConnectionTests.py
>  121c121,124
>  <                 os.waitpid(pid, 0)
>  ---
>  >                 try:
>  >                     os.waitpid(pid, 0)
>  >                 except OSError:
>  >                     pass
> 
> then all tests will pass.

That should be verified (by actually trying it).  For one thing, I
count 8 instances of "os.waitpid(pid, 0)" in the Zope-2_7-branch
branch, and it would be surprising if the other 7 always worked on
your box, right? ;-)

> But then we will not know why the zeo zombie vanishes before
> the waitpid can reap the exit code ;-)

Right, it would be papering over a symptom, leaving the cause unknown.
 If you find that expedient in your installation, that's fine, it's a
key advantage of open source that you can worm around problems on your
own.  Of course I don't want to do that in the distributed code
without understanding the problem first (for example, catching OSError
here could _also_ end up hiding genuine bugs later -- there's no
reason we know of to expect that waitpid() can fail here).

> ...
> PS: i'am afraid it turns out to be a python thread / signals / race
> problem -- yuck!

If you can whittle it down, possible causes will become clearer.

If you want to try some random thrashing, try Python 2.4.1.  Dealing
with signals is a cross-Unix mess, and LinuxThreads fail to conform to
the POSIX standard in some obscure ways related to signals.  2.4.1
tried to worm around that.


More information about the Zope mailing list