[Zope] Tickling ZEO?

Paul Winkler pw_lists@slinkp.com
Mon, 28 Oct 2002 14:56:26 -0800


I've been looking for a "standard" way of saying
"hey ZEO, are you up?" that I can work into our failover system.
(btw, our production zeo server is nearly read-only; all changes
are made on dev and then we sync to production using ZSyncer.
So I'm not worried about losing state when we fail over.)
 

The simplest thing would be this, swiped from zctl.py:

def _check_for_service(host, port):
    """Return 1 if server is found at (host, port), 0 otherwise."""
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        s.connect( (host, int(port)) )
        return 1
    except socket.error:
        return 0

So we just make a raw socket connection to the ZEO port and
say we succeeded if there's anything there.

However, that doesn't seem very robust. Is it possible for ZEO to get
into a state where it accepts socket connections but won't deliver
data? I don't know, and I feel safer assuming "yes".

As pointed out here in some thread or other months ago,
we could always set up a script that makes a write transaction
every so often, but this is Bad. the ZODB would grow and grow
and grow with pointless undo data ...


So here's what we came up with, hacked out of stuff found in the ZEO
test scripts : we make a dummy ZEO client and connect it
and try to pull the root object.

def testzeo():
        storage=ZEO.ClientStorage.ClientStorage((host, port),
                                        name='ZEO Heartbeat Test at %s:%s ' %
                                        (host, port),
                                        wait_for_server_on_startup=0,
                                        client=None,
                                        debug=1,
                                        cache_size=0)

        storage.registerDB(DummyDB(), None)
        storage.load('\0\0\0\0\0\0\0\0', '')
        storage._call.__haveMainLoop = 0
        storage.notifyDisconnected = dummy
        storage.close()
        return 1 


If that fails for any reason, ZEO is down.
As far as we can tell, this *works*: if ZEO is shut down we
get an error, if it's not, we get 1. Great.

But this seems very non-kosher. I've never heard of anybody opening
and closing ZEO connections every 5 seconds on a production site.
And we've seen a couple of very weird errors just after ZEO starts
- a page loads in Zope, following a link gives a weird error
(like no doctstring on a DTML method), then all links after that
are fine.

I can't help but think that our test script is contributing to this
sporadic, only-on-startup flakiness, but I have no real evidence
for that.

So back to the original question... Is there a Right Way to
check if ZEO is really up and running?

-- 

Paul Winkler
http://www.slinkp.com
"Welcome to Muppet Labs, where the future is made - today!"