[ZODB-Dev] ClientDisconnected error on windows

Jeremy Hylton jeremy@zope.com
Wed, 8 Aug 2001 17:19:05 -0400 (EDT)


>>>>> "JDH" == John D Heintz <jheintz@isogen.com> writes:

  JDH> Didn't work for us.  Also, on linux we get really long pauses
  JDH> where there is no CPU being used - I can only presume python is
  JDH> blocking on the socket.

  JDH> What can I do to capture more detail?  We should be able to
  JDH> package up a reproducible test case - is that the best thing?

A test case that I can run on my machine is probably best.  If you can
give some idea of what the StorageServer is doing that would help, too.

  JDH> What I can package is our use of ZODB for omniORB.  It would
  JDH> require you to have omniORBpy installed but then it should be
  JDH> trivial to run our code.  Do you want this?

That sounds good.

  JDH> The only other thing I can mention right now is that we are
  JDH> getting these errors during stress testing - multiple threads
  JDH> hitting our server full out.

That may be helpful.

I'm still having trouble understand what the problem is.  It's hard
for me to grasp the larger system environment and what you think the
correct behavior would be.  The tracebacks alone show a tiny snapshot
of the system state, but not necessarily an illegal one.  So I've got
a bunch of questions below; feel free to respond to only the relevant
ones :-).

The tracebacks you show seem like part of the documented behavior of
ZEO.  You can get a ClientDisconnected error when you commit a
transaction.  You can also get a POSException, a thread.error, or a
socket.error.  [I'll treat your POSException suggestion separately.]

It may take a moment for a client to reconnect, so under heavy load I
can imagine getting this error several times before the client
reconnects.  When you get this error, do you retry the transaction?
If so, does it succeed (eventually)?  Or is this client disconnected
permanently? 

Or is the problem that you don't think there's any reason for the
socket to be failing in the first place?  The error that you're seeing
on Linux occurs on a non-blocking socket when there's no room left in
the pipe or socket to write the data.  Perhaps this is an asyncore bug
-- that asyncore should be catching this and trying again.  I don't
know what that would mean for Windows, since it's a completely
different error.

Jeremy