[ZODB-Dev] ZEO and time.sleep

Benji York benji at zope.com
Wed Mar 28 18:28:08 EDT 2007


Last week I spent a very enjoyable day (no kidding) debugging a very, 
very slow cold-start situation (more than 15 minutes to return from the 
first request).  When making the first request to the app (Zope 3 
based), the app server and storage server would show virtually no CPU 
utilization, and there would be about a megabit of network traffic (on a 
gigabit link).  There was no obvious bottleneck.

After liberal application of strace, tcpdump, wireshark (aka ethereal), 
and the Python profiler we discovered that while waiting for an 
outstanding request for an object to load, ZEO calls a 
threading.Connection instance's wait method with a timeout.  When given 
a timeout that method enters a wait loop with a time.sleep to sleep for 
a while and then see if the condition has been met.

We found that time.sleep on that box had a minimum granularity of 10ms 
(when passed a non-zero value), thus causing each object load to take 
approximately that long.  As you can imagine, that somewhat slowed down 
the retrieval of the several thousand objects required to satisfy the 
initial request(s) (until the ZEO cache was sufficiently warm).

The fix?  Short-term: bump the operating system's timer interrupt on 
that box to 1000Hz from 100Hz, increasing time.sleep's granularity to 
1ms (this was on Linux, Window's time.sleep appears have a much higher 
resolution).

Long-term: Jim has found that the timeout call in the wait-for-result 
code can be avoided, side-stepping the call to time.sleep altogether.
-- 
Benji York
Senior Software Engineer
Zope Corporation


More information about the ZODB-Dev mailing list