[ZODB-Dev] [Danger] "ZEOClientStorage" does not detect lost ZEO connection

Dieter Maurer dieter at handshake.de
Sat Jun 25 01:35:04 EDT 2005


During high availability cluster checks we observed that
"ZEOClientStorage"'s approach to detect connection loss to its
server is unrealiable.

It relies on the fact that the operating system reports
a lost connection. However, by default, TCP does not garantee
any notice for broken connections. While, usually, the OS
can inform the communication endpoints, there are essential
cases where this is not the case: network and processor outages.

In our specific case, one of the two ZEO cluster nodes
was switch off for testing purposes. As expected, the
other cluster node took over the ZEO service.
However, one of our ZEO clients did not notice that it
lost the server connection and happily worked with stale
ZODB data (from its caches) for days. Of course, it did
not try to write ZODB data (otherwise, it had noticed the
lost connection).


Probably, "ZEOClientStorage" (and the ZEO server) should
use "SO_KEEPALIVE" to enable TCP keepalive messages.
However, the default TCP timeouts are probably too high (2 hours)
for many ZODB applications (like ours).

I will implement an application specific keep alive mechanism.



-- 
Dieter


More information about the ZODB-Dev mailing list