[ZODB-Dev] B-Tree Concurrency Issue (_OOBTree.pyd segfaults)

Tim Peters tim at zope.com
Sat Apr 16 11:10:32 EDT 2005


I believe I identified one cause for catastrophic thread-race problems here
(sharing a Connection across multiple threads, or for any other reason
sharing an in-memory persistent object across non-serialized threads), even
when objects are never modified.  It doesn't look shallow to me.

The C code for materializing an object's state (unghostify() in
cPersistence.c) contains this pair of lines in ZODB 3.2:

        self->state = cPersistent_CHANGED_STATE;
        r = callmethod1(self->jar, py_setstate, (PyObject*)self);

That's pretty mysterious as-is.  A comment before that in ZODB 3.3 tries to
clarify:

	/* set state to CHANGED while setstate() call is in progress
	   to prevent a recursive call to _PyPersist_Load().
	*/

Alas, that's also a massive (yet tiny <wink>) timing hole when sharing an
in-memory persistent object across threads:  setstate can call back into
Python (if for no other reason, possibly to look up the "__setstate__"
attribute on self), and the GIL can be released then, so other threads can
run while setstate is in progress.

At this point, self's memory state is still missing, but self->state claims
it's legitimate.  So if another thread tries to access the same in-memory
object while the thread running setstate is in any piece of Python code, it
will be fooled into thinking the state has already been materialized, and
read up nonsense bytes.

In particular, the `len` and `size` members of a ghost Bucket are both 0,
and that obviously explains the sporadic

    AssertionError: Bucket length < 1

_check() failures.  It's also the cause-- although it's subtler --of the

    RuntimeError: the bucket being iterated changed size

failures:  when BTreeItems_seek() sees a bucket for which the current
iteration cursor position within the bucket is >= the current len of the
bucket, the only conclusion it can draw is that someone removed items from
the bucket since the *last* time BTreeItems_seek() was called; or, IOW,
someone changed the bucket's size during iteration.

The cursor is initialized to 0, and that's >= self->len (which is also 0)
for a ghost bucket, so cursor >= self->len the first time BTreeItems_seek()
is called, and we get a misleading message as a result.

It's never a misleading message given a non-insane BTree, because
it's impossible for a (non-ghost) bucket that's part of a BTree to have
self->len == 0 (an empty bucket is always physically removed from a BTree
internally, as soon as it becomes empty), and BTreeItems_seek() is correctly
doing the "if this bucket is a ghost, materialize its state" dance before
accessing self->len.  self->state is lying about whether it's a ghost here,
and garbage-in garbage-out follows.

Of course worse things than exceptions can happen:  reading up a pile of NUL
bytes when real data is expected can lead to anything (e.g., segfaults
aren't surprising given this).

Note that this also explains why the problems went away when I introduced a
global mutex to strictly serialize persistent object access across the
threads.  It really is necessary to serialize "even just reads" if you're
sharing in-memory objects among threads, if there's any chance that one of
them is a ghost.

There may be other reasons too; I can't promise that strictly serializing
all thread access reliably prevents all possible problems here (although it
seems likely to me, it's not an intended use, and the analysis in this msg
is all that's been done wrt it).  Note that the old email message Martin
referenced:

http://mail.python.org/pipermail/python-list/2001-February/030675.html

said that what's what its author did:

    In our application, we serialize our access to the set of
    Persistent objects amongst all threads (effectively using a
    global lock on the root object that is obtained whenever a
    thread is going to be playing with state data), and any work a
    thread does on those objects is committed before the lock is
    released.




More information about the ZODB-Dev mailing list