[ZODB-Dev] Re: RESOLUTION: Re: more lockup information / zope2.9.6+zodb3.6.2

Tres Seaver tseaver at palladion.com
Wed Apr 18 11:37:15 EDT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Paul Williams wrote:

> It sounds like we have the same problem.  We had contracted Tres Seaver
> write us a keepalive tool to ping the server periodically.  This has
> fixed our problem and we haven't had a problem in 8 days.  We used to
> have this problem at least once a day.

I've released the product here:

 http://agendaless.com/Members/tseaver/software/keepalive

Basically, the product provides a "tool" which clients ping with their
status, at configurable interfals;  you can also monitor that status
within the ZMI.  The README explains more:

 http://agendaless.com/Members/tseaver/software/keepalive/keepalive-0.1/README.txt

> Overview
> ========
> 
> This product supplies a tool and some helpers for creating a
> "keepalive" configuration between a ZEO storage server and one or more
> clients (typically Zope application server processes).
> 
> This configuration is intended to keep "enough" traffic moving across
> each client-server connection to defeat TCP-violating "middlemen"
> (firewalls, routers, etc.), which have been observed to abort idle
> connections without doing proper TCP teardown on them.
> 
> The symptom in such a case is that one of the pair (the client)
> believes, even at the kernel level, that its connection to the other
> remains open; the other endpoint (the server) typically sees the
> connection close, and logs that. The hapless client usually ends up
> blocked on a read from the server which can never be satisfied, and must
> be manually restarted or "hupped" to recover.
> 
> The irony here is that ZEO's caching actually contributes to the
> problem: if the "working set" in an application server's cache is
> coherent with its usage patterns (reads), it doesn't need to send any
> packets to the storage server, and thus falls prey to the "idle timeout".
> 
> Theory of Operation
> ===================
> 
> Defeating such hostile behavior at the application level is a bit of
> a kludge: essentially, we must create enough non-cacheable activity on
> each client to force periodic writes / reads to the storage server, with
a frequency high enough to avoid having the connetion appear idle.
> 
> The 'keepalive' product assists the site manager to construct a
> configuration which generations some traffic, using the following
> components:
> 
> - A ZODB-based tool, which stores state for each client (minimally, a
>   timestamp), based on a user-defined key. The ZEO traffic in the
>   configuration will be primarily writes and reads to this per-client
>   state.
> 
> - A backported version of the ZServer.Clockserver shipped with Zope
>   2.10.x. The clock server allows the site manager to configure the
>   traffic across the ZEO connection without requiring an external
>   trigger such as cron.
> 
> These two components together can be used to configure a "chatty"
> protocol between each ZEO client and the storage server.

Paul continued:

> The biggest thing is that it is seen by some as a bug in Zope or Python
> since we fixed it with a keepalive.  How do we definitively clear Zeo
> infrastructure?  Is it somehow linked to python code not recognizing the
> connection loss or is this strictly an iptables issue.  Is it a bug in
> iptables or just a mis-configuration?

First, for clarity, the case we are discussing here is one in which
'netstat' on the client shows that the connection to the server is open,
while 'netstat' on the server shows it as closed (the server's logs also
record the disconnect).  In such a case, Python has had no chance to
detect the closure:  even the *kernel* on the client machine doesn't
know that the connection has gone away.

Paul has heard me on this, but just for the record:  sysadmins who
deploy firewalls which violate TCP in this way in the name of "security"
are DOS-ing themselves.  While it might be tolerable to break the
protocl to end abusive connections across public-facing interfaces,
blindly applying such a rule as a blanket policy on internal networks is
not competent.


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGJjsr+gerLs4ltQ4RAp8YAKC5ZKxRVaTkUv6r8biVDzX+mNos2ACgx56v
Cu1+hZt0jGfmuHZOep8E+0I=
=IA+R
-----END PGP SIGNATURE-----



More information about the ZODB-Dev mailing list