[Zope-dev] Re: Zope hanging (poss. threads-related)

Marcus Collins mcollins@sunesi.com
Mon, 17 Apr 2000 18:50:41 +0200


> -----Original Message-----
> From: Tres Seaver [mailto:tseaver@palladion.com]
> Sent: 17 April 2000 01:20
> To: zope-dev@zope.org
> Cc: Michel Pelletier; Marcus Collins; Tony Rossignol
> Subject: Re: [Zope-dev] Re: Zope hanging (poss. threads-related)

<snip>

> Here are a couple of commonalities we need to look at:
> 
>   * IE 5 seems to be involved.

From combining the logs of Apache and DebugLogger, it doesn't seem
restricted to IE5. The following are user agents that, it seems, initiated
hung threads over the weekend:

  * Mozilla/4.0 (compatible; MSIE 5.0; AOL 5.0; Windows 98; DigExt)
  * Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)
  * Mozilla/4.7 [en] (Win98; I)

>   * The management interface seems to trigger the problem (no
>     application-specific code involved, just manipulation of "stock"
>     Zope objects).

Although I've experienced hanging after using the management interface, this
doesn't seem to be the sole cause. There was no manage activity between
Zope's last restart and our weekend's hangings.
 
>   * Framesets may be involved.

Definitely. Although other threads, representing non-frameset documents,
have also failed to complete, it seems the the initial hanging is *always*
brought on by frameset documents.

>   * Perhaps premature request cancellation (likely to happen often
>     in the management interface) is involved.

Quite possibly, although it would be incredibly hard to determine a
correlation from log files...
 
> I would like to see a description of the interaction between the 
> ZServer-async-select() thread and the Zope threads to which 
> it dispatches requests, particularly what happens when a Zope thread 
> tries to write to a connection which has been closed by the client 
> without a "clean" shutdown (e.g., attempting to write to the socket 
> with fail with errno == EPIPE).

Do you suspect that IE 5 is not closing connections properly, or that
ZServer is not handling broken connections properly?

> I have a couple of testcases for this problem.  
> (Un)fortunately, I don't use IE5 (and don't go with girls who do! 
> :), and therefore they all work without a hitch.

<snip>

I've been exercising your test script with IE and Netscape, and have several
unfinished requests, although I think at least some of these are the result
of premature request cancellations. However, I did come across several
instances of the following, which possibly warrant attention/explanation:

  request 143894896 did not end
  B 143894896 2000-04-17T14:58:48 GET
/mctest/breakZope/test_stall?seconds:int=30
  I 143894896 2000-04-17T14:58:48 0
  A 143894896 2000-04-17T15:01:05 200 0
  B 143894896 2000-04-17T15:01:45 GET
/mctest/breakZope/test_stall?seconds:int=30
  I 143894896 2000-04-17T15:01:45 0
  A 143894896 2000-04-17T15:04:23 200 0 

It seems strange that *the same thread*, having failed to complete the
request, should begin another. This was evident from the latest hanging log
data from our live site, too, but in more detail and with a twist:

  request 145377824 did not end
  I 145377824 2000-04-15T07:21:53 found_terminator: using StringIO;
size=2057
  B 145377824 2000-04-15T07:21:53 POST
/ContentInput/Work/Generate/ConfirmSubmitContent
  I 145377824 2000-04-15T07:21:53 found_terminator: using StringIO;
size=7300
  I 145377824 2000-04-15T07:21:53 0
  X 145377824 2000-04-15T07:21:53 send_response: before PCGIResponse
  X 145377824 2000-04-15T07:21:53 send_response: before HTTPRequest
  X 145377824 2000-04-15T07:21:53 send_response: before handle 

These must surely be two requests (see the different sizes), although I am
really puzzled at how two different requests to this page (incidentally, not
a frameset) could be made inside of a second, by the same person. It's also
peculiar that the log begins with "I", not "B".

The plot thickens...

In the latest spate of hangings, a thread hung, but *some* requests
continued to be served, while failed. This behaviour continued until it
appeared that all requests had failed, at which point Zope was restarted
(meatware process). During this time period, frameset pages were requested,
and at least some failed.

The above information is expanded on at
http://www.zope.org/Members/tseaver/Projects/HighlyAvailableZope/MarcusColli
ns/

Thanks again,

-- Marcus