[Zope] ZEO troubles on RedHat EL4 Linux

Tim Peters tim.peters at gmail.com
Fri Aug 19 14:41:14 EDT 2005


[Dieter Maurer]
> There is one essential thing you stress over and over again -- but
> which I am not sure:
> 
>      You say, the exception in "tearDown" means that
>      the test completed successfully -- without any error.

Oh no, that's not what I'm saying.  As you say next,

>      However, I am convinced that "tearDown" is called, too,
>      when the test fails.

That's right, it does.  What I said is that _if_ the only listing of
errors/failures we've seen here was in fact an exhaustive list of all
the errors/failures that were seen in that run, _then_ we can deduce
that the tests passed.  That's simply because if a test body failed,
that would have produced an _additional_ error/failure report.  But
every one of the error/failure output blocks in the message showing
them was the same waitpid() complaint, reached from tearDown(). 
Assuming this was an exhaustive listing, there were no error/failure
reports of any kind stemming from test setup or test body code, only
from test teardown code.

So it's not that I saw an error in tearDown() that causes me to
believe "the tests passed", it's that we haven't seen any
errors/failures _other_ than tearDown() errors.

Willi was later kind enough to include what looked like a screen
scrape of an entire test run, and I think we can be sure of that
there:

"""
Running one single test:

 bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection
checkNoVerificationOnServerRestart\$
 Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python
 ======================================================================
 ERROR: checkNoVerificationOnServerRestart
(ZEO.tests.testConnection.FileStorageReconnectionTests)
 ----------------------------------------------------------------------
 Traceback (most recent call last):
   File "/home/wlang/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py",
line 121, in tearDown
     os.waitpid(pid, 0)
 OSError: [Errno 10] No child processes

 ----------------------------------------------------------------------
 Ran 1 test in 0.689s

 FAILED (errors=1)
"""

If the setup code or body of checkNoVerificationOnServerRestart had
something to complain about too, I would expect to see an additional
blob of ERROR or FAILURE output.  The ConnectionTests.py code that
starts a ZEO server process doesn't swallow exceptions, and simply
cannot add the pid returned from spawnve() to its list of tids to wait
for later unless ZEO/tests/forker.py's start_zeo_server() returns
normally.

> I did not point this out earlier, because you are probably right.
> 
>   If the test itself had failed, we should probably have seen
>   a previous exception and a "pid" cannot be registered for
>   later clean up before it was created.

As above, yes.

> Looks as if there were something that eats the dead child before
> the "waitpid" could take care of it.

Yup.

>  I know that a SIGCHLD/SIG_IGN can do that
>  or a "waitpid(pid)" with "pid <= 0".
> 
> If for some reason, a value "<= 0" happened to arrive in the
> list of processes to be cleaned up, then this could explain
> the strange non-deterministic behaviour.

Perhaps they can add some print statements or asserts then, to test
that possibility.  From the Python docs:

    If pid is 0, the request is for the status of any child in the process group
    of the current process.
    If pid is -1, the request pertains to any child of the current process.
    If pid is less than -1, status is requested for any process in the process
    group -pid (the absolute value of pid).

If the OS happens to return a pid "with the sign bit set", I'm not
sure whether the Python implementation of all this stuff would manage
to do "the right thing".  Python's waitpid() wrapper definitely treats
the pid as a native signed C int, not as being of type pid_t.  OTOH,
pid_t isn't part of standard C, it's a Unix thing, and I believe pid_t
 _is_ C int in glibc.  If so, then a pid "with the sign bit set" is
simply impossible to use in a call to waitpid(), so it would be an OS
bug if it ever returned a pid with the sign bit set.


More information about the Zope mailing list