[ZCM] [ZC] 1350/26 Assign "Zope HEAD (2.8) chokes under high load"

Thu Oct 7 00:38:34 EDT 2004

Issue #1350 Update (Assign) "Zope HEAD (2.8) chokes under high load"
 Status Accepted, Zope/bug critical
To followup, visit:
  http://zope.org/Collectors/Zope/1350

==============================================================
= Assign - Entry #26 by tim_one on Oct 7, 2004 12:38 am

 Status: Pending => Accepted

 Supporters added: tim_one

Assigned this to me, since there's almost certainly one bug in ghostification.
________________________________________
= Comment - Entry #25 by tim_one on Oct 7, 2004 12:28 am

I provoked 3 more failures with that print in place, and in all cases the print triggered, and soon after the test assert-failed or segfaulted, with an obvious connection to the address that got printed.  Sometimes it was filled with 0xdb bytes, sometimes it was the gc "refcount too small" assert failure.

So I'm sure all the errors I'm seeing are due to a bug in the basic persistence machinery for ghostifying/deactivation.

Too fried to try to fix it tonight, though; sent email to Jeremy and Jim asking whether they remember why this code does what it does.  I suspect it will remain impossible to understand, yet easy to fix <wink>.
________________________________________
= Comment - Entry #24 by tim_one on Oct 6, 2004 11:46 pm

Ya, that may be on to something.  I added a printf in ghostify(), to print the address of self if self's ring pointer was NULL.  That eventually did print an address, and almost immediately thereafter the same address showed up as op in a visit_decref() assert failure.  The code in ZODB 3.2's ghostify() did not decref self if the ring pointer was NULL.

Unfortunately, I have no idea how the ring pointer *can* be NULL for a non-ghost object passed to ghostify().  That sounds like a basic question, but this code follows an XXX comment with an even more basic question:

    /* XXX is it ever possible to not have a cache? */

Beats me.
________________________________________
= Comment - Entry #23 by tim_one on Oct 6, 2004 11:19 pm

The `gc->gc.gc_refs != 0' failure is the most useful, because op hasn't been destroyed at that point.  I stared at 7 such assert failures this evening.  In some op was an OOBTree, in others it was an IOBTree, and once it was a Length2 object.  In all these cases, the object was a ghost at the time.  That's checked by doing:

    ((cPersistentObject*)op)->state
or
    *((char*)op+44) [on a 32-bit box]

If that evaluates to -1, op is a ghost, and so far it always has been for me.

That must be a clue, but I don't know what it's pointing at; there's no obvious (to my eyes) refcount error in the C code that ghosts or unghosts persistent objects.

Since BTrees and Buckets aren't ExtensionClass instances in Zope trunk (but are persistent objects), this makes me back off from believing "the bug" is in ExtensionClass.  Then again, there may be more than one bug here.

Hmm.  It *is* fishy that ghostify() decrefs self even if self->ring.r_next is NULL.  But now that I think about it 3 times, I have no idea what that code thinks it's doing ...
________________________________________
= Comment - Entry #22 by dunny on Oct 6, 2004 9:06 pm

I can get now get a failure when using FileStorage. But the trick is to use a *second* FileStoarge mount for the session TOC. So far I've seen:

(a) pass with just one FileStorage for both root and TOC.

(b) failure with FileStorage for root and TemporaryStorage for TOC.

(c) failure with one FileStorage for root and a second FileStorage for TOC.

Where pass means no failure after running the test for significantly longer that the time it takes to see a failure in the other cases. :-)

I have not really got anywhere with gdb'ing, even when reducing the gc thresholds to (1, 1, 1). It feels like the various errors/asserts that drop me into gdb are quite a way down stream from the source of the error. Though gdb does feel a little less foreign now.
________________________________________
= Comment - Entry #21 by tim_one on Oct 5, 2004 9:35 pm

Chris, I'm going to be in FB tomorrow (Wednesday).  If you are too, maybe you could drop by and we can stare at a stack together.
________________________________________
= Comment - Entry #20 by tim_one on Oct 5, 2004 9:07 pm

Excellent!  It doesn't help <wink>, but it makes sense.
0xdbdbdbdb is the DeadByte pattern.  In a debug build, whenever an object is free()'ed at the C level, Python arranges to fill it entirely with 0xdb bytes.  So whatever res is pointing at here, it's already been free()'ed.  That's again most likely a symptom of a refcount that's too small.  The only other thing I can guess is from the name of the routine -- it's trying to access an attribute of an ExtensionClass.  Crawling up the call stack may (or may not) reveal the name of the attribute.

BTW, which Python are you using?  It must be >= 2.3.4 for Zope 2.8.
________________________________________
= Comment - Entry #19 by mcdonc on Oct 5, 2004 8:59 pm

(gdb) p res
$1 = (PyObject *) 0xf5f50994
(gdb) p res->ob_type
$2 = (struct _typeobject *) 0xdbdbdbdb
(gdb) p res->ob_type->ob_type
Cannot access memory at address 0xdbdbdbe7
(gdb)

Fleeg.
________________________________________
= Comment - Entry #18 by tim_one on Oct 5, 2004 8:58 pm

Heh -- it's been 10 years since I used gdb.  Does res makes sense?  Does res->ob_type?  Like that.  It's certainly interesting if either of those are, e.g., NULL.
________________________________________
= Comment - Entry #17 by mcdonc on Oct 5, 2004 8:51 pm

FWIW, I am currently sitting at a GDB prompt which was the result of a segfault:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -190473296 (LWP 14867)]
0x001b09a7 in Base_getattro (obj=0xf35d781c, name=0xf6a6c9c8)
    at ExtensionClass/_ExtensionClass.c:145
145                               if (res->ob_type->ob_type == &ExtensionClassType
(gdb)

Is there anything special I should try to preserve about the state of this process before quitting the debugger?  (or would explaining take too long to make it worthwhile?)
________________________________________
= Comment - Entry #16 by tim_one on Oct 5, 2004 8:25 pm

These two:

python: Modules/gcmodule.c:231: visit_decref: Assertion `gc->gc.gc_refs != 0' failed.

Fatal Python error: Objects/frameobject.c:394 object at 0xf35fa634 has negative ref count -2

are both saying that a refcount is too small (there are more pointers to an object than that object's refcount believes there are).  The cause for that must be in C code, since refcounts are maintained entirely in C.

The second one is very odd:  *usually* when this triggers, the ref count it prints is -1.  It's not possible to get to -2 purely by doing Py_DECREF()s.  In a debug build, the expansion of Py_DECREF contains the code that checks for a negative refcount (that's the code that produces the message you're seeing in the second case).  So to get to -2, it would first needed to have gotten to -1, and Py_DECREF would have bitched *then*.  So that almost suggests a wild store, someone overwriting the ob_refcnt member by mistake.

Hate to say it, but suspicion in my head is starting to fall on the ExtensionClass-as-newstyle-Python-class code new in 2.8.  Alas, only Jim knows what that code does.  I'm suspicious of that because there are no known bugs of this sort in the Python core; that ExtensionClass code is new in 2.8; the new ExtensionClass code is entirely ununused by ZODB on its own, or by Zope 3, or by Zope X3; and your other traceback was pointing directly into the new ExtensionClass code.

It might help to reduce the gc thresholds to (1, 1, 1), so that refcounts get checked on every object allocation.  That's a lot slower, but has a much better chance of catching the problem soon after it pops into existence.  The gc.gc_refs != 0 assertion catches a too-small refcount no matter how large the refcount is at the time.  The "negative ref count" assertion can't find a problem until after the object has already been deallocated.
________________________________________
= Comment - Entry #15 by mcdonc on Oct 5, 2004 8:01 pm

Another assertion error:

python: Modules/gcmodule.c:231: visit_decref: Assertion `gc->gc.gc_refs != 0' failed.

I guess I should run this under gdb and investigate.  It's like putting an infant at the wheel of a front end loader, but I'll give it a go.
________________________________________
= Comment - Entry #14 by mcdonc on Oct 5, 2004 7:55 pm

tim sez:

> 1. The first thing to do with a segfault is to build everything in debug
>   mode, both Python and Zope.  Python in particular has lots of C assert()s
>   that vanish in a release build.

I think I did this properly.  I compiled Python with the "--with-pydebug" flag and installed it.  I then compiled Zope extensions using the debug-compiled Python via  "setup.py -i -g".  When compiling Zope extensions, I saw that the -DNDEBUG flag was missing for all gcc invocations, so I think it's right.

In any case, under this setup, I did a few runs:

Run 1:

After about 20 minutes of continuous running, I saw no segfaults, but neither did I see any assertion errors.  I did see application errors pop up not unlike the ones we bashed into submission on the 2.7 branch (seemingly random KeyErrors resulting from desynchronization of session index and data).

Run 2:

After only about 10 seconds of running, I saw the following assertion error on the Zope console and Python subsequently quit:

Fatal Python error: Objects/frameobject.c:394 object at 0xf35fa634 has negative ref count -2
Aborted

Run 3:

After half an hour, same results as Run 1.

I need my computer back for now, but I will let one run overnight for fun.

dunny sez:

> I also get the segfaults when using TemporaryStorage for transient object
>   container. But not when using FileStorage.

FWIW I haven't tried FileStorage yet.  All of my tests have been run using TemporaryStorage as the place where the session data gets stored.  Dunny still can't reproduce the error when using FileStorage instead of TemporaryStorage, so that's a clue.  I will try to back this assertion up later.

It's probably also worthwhile noting that I'm using Linux and dunny is using Mac OS X, so the symptom doesn't appear to be platform-specific.

Dunny captured a few segfaults via gdb today (I'm not sure whether his Python was configured --with-pydebug however):

Program received signal EXC_BAD_ACCESS, Could not access memory.
[Switching to process 14845 thread 0x2903]
0x000e36f4 in visit_decref (op=0x2d29b18, data=0x0) at Modules/gcmodule.c:244
244             if (PyObject_IS_GC(op)) {

.. and ...

0x0168b218 in Base_getattro (obj=0x2d24140, name=0x1663878) at ExtensionClass/_ExtensionClass.c:145
145                               if (res->ob_type->ob_type == &ExtensionClassType

There was one other one that I failed to capture, but it had nothing in common with either of the above.  In any case, there appeared to be no commonality in the place where the segfault is encountered.

> 2. When I was trying to stitch a current ZODB into the Zope trunk, I got
>   many segfaults at first.  I don't know what caused any of them, because
>   they all went away by themselves, as I added code to explicitly abort
>   transactions that had suffered a failing commit, and fixed other
>   incompatibilities.  Maybe something is left in an insane state if you
>   don't do that.  Since the segfaults went away without trying to track
>   down any of 'em, I never found out!

The Zope HEAD has the same publisher guts that the 2.7 branch does.  The publisher typically exclusively manages transactions within Zope.  None of the session rig code (or any of the code it calls) does its own transaction management (well, aside from registering an object with a transaction; it doesn't commit or abort anything outside of the publisher), so we'd probably need to blame the publisher as the thing that didn't abort a failed transaction properly.  It's hard to do that, though, because it's based on the 2.7 code and appears to do the right thing under 2.7.
________________________________________
= Comment - Entry #13 by dunny on Oct 5, 2004 4:55 am

I also get the segfaults when using TemporaryStorage for transient object container. But not when using FileStorage.

Seems to break in a few different places. (I have yet to collect any details.) 

Also get segfaults when trying with a debug Python (in this case version 2.4a3) and Zope. 

Furthermore it looks like session machinery is somehow dropping session data well before the segfault. TestSessionRig reports of missing session data for bid's that should exist according to its own log.
________________________________________
= Comment - Entry #12 by tim_one on Oct 4, 2004 4:29 pm

1. The first thing to do with a segfault is to build everything in debug mode, both Python and Zope.  Python in particular has lots of C assert()s that vanish in a release build.

2. When I was trying to stitch a current ZODB into the Zope trunk, I got many segfaults at first.  I don't know what caused any of them, because they all went away by themselves, as I added code to explicitly abort transactions that had suffered a failing commit, and fixed other incompatibilities.  Maybe something is left in an insane state if you don't do that.  Since the segfaults went away without trying to track down any of 'em, I never found out!
________________________________________
= Comment - Entry #11 by mcdonc on Oct 4, 2004 4:11 pm

It looks like the segfault is caused by someting that depends largely on some sequence of events happening in a particular order (which happen to be evoked by the session rig tests) as opposed to some overflow, because I just ran the rig test again in the same manner and didn't get a segfault for almost 15 minutes as opposed to about 5 before.
________________________________________
= Edit - Entry #8 by mcdonc on Oct 4, 2004 3:49 pm

 Changes: submitter email, edited transcript
________________________________________
= Resubmit - Entry #7 by mcdonc on Oct 4, 2004 3:48 pm

 Status: Resolved => Pending

Spoke a bit too soon.

Running the session rig against the current trunk in the manner indicated in an earlier message to this issue indeed no longer deadlocks but now provokes a segfault.  For me, the segfault happens after ~ 5 minutes of continued pounding; not immediately.  (Note that it is no longer required that you set up a "/toc" object in the Zope root because the temp folder appears to work now).

Updated "rig files" are in CVS at cvs.zope.org:/cvs-repository within Packages/SessionRig.  If you try to reproduce, please use the files in CVS rather than the files uploaded to this issue.
________________________________________
= Resolve - Entry #6 by mcdonc on Oct 4, 2004 3:44 pm

 Status: Pending => Resolved

The deadlock problem appears to have been solved at some point since this bugreport (maybe since Tim's merge of the 3.3 HEAD?).  Running the test rig reports ancillary data consistency errors ("wrong session data"), but these are unlrelated to the deadlock and may be a problem of the test rig itself.
________________________________________
= Comment - Entry #5 by mcdonc on May 23, 2004 2:40 am

Uploaded:  "SessionRigExtensions.py"
 - http://zope.org/Collectors/Zope/1350/SessionRigExtensions.py/view

________________________________________
= Comment - Entry #4 by mcdonc on May 23, 2004 2:39 am

Uploaded:  "TestSessionRig.py"
 - http://zope.org/Collectors/Zope/1350/TestSessionRig.py/view

________________________________________
= Comment - Entry #3 by mcdonc on May 23, 2004 2:39 am

Uploaded:  "sessionrig.zexp"
 - http://zope.org/Collectors/Zope/1350/sessionrig.zexp/view

________________________________________
= Comment - Entry #2 by mcdonc on May 23, 2004 2:37 am

Note also that the same stresstest completes ok against the 2.7 branch code.
________________________________________
= Request - Entry #1 by mcdonc on May 23, 2004 2:37 am

It seems there is a deadlock problem somewhere in the code on the Zope HEAD.  I noticed this when running a high-stress session test against it.  The symptom is that it "freezes" shortly after maybe 40 or 50 near-simultaneous HTTP requests.  This happens with and without the "error handling in main transaction" changes I just made to ZPublisher/Publish.py and Zope/App/startup.py .  The code was also freshly compiled when I ran the test.

To replicate:

- Download the attached files:

  - TestSessionRig.py -- a runnable script which
    hammers Zope with simultaneous read/write HTTP
    requests in order to test sessioning stability.
    This can be placed anywhere.

  - sessionrig.zexp -- an importable 'zexp' file, place
    this in Zope's import directory.

  - SessionRigExtensions.py -- an 'external method'
    module, which is used by sessionrig.zexp.  It should
    be placed in Zope's 'Extensions' folder.

Set up:

  - import sessionrig.zexp into the root folder using
    the ZMI.

  - Create a "transient object container" named "toc"
    within the root folder using the ZMI "Add.." dropdown.
    Set its "data object timeout value" to 1 minute.

  - Visit the session_data_manager object present in
    Zope's root folder and change it's 'transient object
    container path' to '/toc'.

Run the test:

  - Start Zope.

  - Edit the TestSessionRig.py file and change the 'path'
    global from the default "http://localhost:8085/sessionrig/customerror'
    to something that matches your Zope's HTTP port
    number.

  - run "python TestSessionRig.py"

The test will "crap out" after some short period of time; for me it is about 20 seconds.  The CPU usage is basically nil on the process when it does so, but it seems as if all database connections are consumed, because the ZMI is inaccessible.
==============================================================