[Zope] Urgent help needed: Zope falls over under moderate load

Tue, 20 Nov 2001 16:38:30 -0500

> I guess I'm confused. Everything that *could* be cached *was*
cached.
> And no, I don't run a caching server or a proxy server or anything
else
> in front of Zope. I'm a writer, not a programmer.

OK, fair enough.

But your profession still doesn't absolve you from needing to cache
more in order to survive a Slashdotting.  ;-)  Either that or you'll
need to start developing your site with static pages only.  That'd
work too.

> The /. piece hit about 1:00 AM. By 1:01 AM Zope had folded like a
cheap
> suit. It's still going down about every 40 minutes or so.
>
> Now remember, my outbound bandwidth is limited to 512Kb.

If 512Kb/s is hit by as many 300-byte requests per minute as possible,
this translates into without taking into account latency or response
usage a potential inbound rate of 213 requests per second.  That's
still a lot of requests.  As something to measure that up against at
peak normal load, Slashdot gets about 180 requests/sec.  The 512Kb/s
isn't much of a throttle.

And this is assuming that your inbound bandwidth is limited to
512Kb/s.. you only mentioned your outbound in this mail.  If inbound
is higher, it's even more of a problem.

> Am I correct in my understanding that Zope can't handle even 512Kb
of
> demand without some technical doohickey in front of it so it doesn't
> fall down?

Your pipe is fat enough to allow lots of requests in, and what you're
serving is probably sufficiently complex to be very slow.  Squishdot
is really not known for its speed.

"Raw" Zope itself could almost certainly handle it, however, if what
you were returning is a DTML method that said "<html>this is a simple
page</html>".  But this isn't what you're returning; Squishdot has a
big say in what shows up.

> No offense intended, but I think two internal Squishdot pages meet
the
> definition of pretty dang simple.

Maybe conceptually it's simple, but apps like Squishdot do lots of
stuff in order to generate these pages.  For fun, you should try to
set up a "barebones" Squishot with the default homepage, and hit it
repeatedly with a load-generator like Apache's "ab".  Then try the
same thing with a Zope page that is "<html>Hello!</html>".  You will
see a big difference.  On an 850Mhz box at ZC, I can get Zope to serve
about 152 requests/s with the simple page.

Anybody want to try this with an out of the box Squishdot homepage?
Or a Squishdot story page?  The guy from the KDE dot
(http://dot.kde.org) claimed he could only get about 2 requests/second
out of a Squishdot home page.  After setting up caching properly, he
was able to get about 2000.

> And why does it fall over anyway? This just doesn't make any sense
to
> me. I can see it getting slow and timing out, but giving up
completely
> and just bailing? What's that about? Explain it to me like I'm an
> intelligent, non-technical friend. Thanks.

The big "bang for buck" solution provider is caching.  Assuming that
you had no problems *before* the slashdotting, that will solve your
problem because it will cause Zope to need to serve far fewer
requests, closer to the number of requests you normally get.  And this
is (I assume) the outcome that you actually want.  I highly recommend
setting up a caching proxy in front of Zope if this sort of load will
be recurring.  It's way faster and cheaper than trying to understand
the problem deeply.  ;-)  Most commercial sites are developed using
this principle, AFAICT.

But if you're as interested in understanding the phenomena as you are
in solving the problem and you'd like to help the current Squishdot
maintainer and ZC improve their products' behavior under load, it'd be
necessary to know more details about how it was failing under load and
what happened during the failures.  I would be interested in these
results.  It could be a memory leak, it could be a Zope bug, a
Squishdot bug, it could be just about anything.  You need forensic
information and you need to let it fail under load in order to get it.

Usually, you can get this info by turning on "big M" logging (by
passing "-M detailed.log" at the end of your start.bat script, maybe).
On Linux, I'd recommend also using the ForensicLogger product (see
http://www.zope.org/Members/mcdonc) to gather more details such as
memory utilization and CPU utilization; it doesn't work on Windows,
however.  If you're willing to do this, let it fail under load, then
send the log with the failure in it to me and I will try to analyze
it.

Note that you *might* be able to make use of the AutoLance product at
http://www.zope.org/Members/mcdonc to autorestart your machine for you
if you've got a memory leak.

HTH,

- C