[ZODB-Dev] what appears inside zodb storage?

Barry A. Warsaw barry@zope.com
Thu, 19 Dec 2002 00:34:20 -0500


>>>>> "DM" == David McCusker <david@osafoundation.org> writes:

    DM> I wonder whether someone can verify my current understanding
    DM> of what appears inside a zodb storage like file storage.  (If
    DM> other storages have contents that differ substantially, this
    DM> would be good to know in some detail.)

Since you asked... :)

I'll briefly <wink> discuss the data that Berkeley full storage
maintains.  The minimal storage (no-versions no-undo) is much simpler
because it does not maintain multiple object revisions, so it keeps
much less metadata.  Gory details are available in the comments in the
code.  Apologies in advance for any fatigue induced errors. ;)

Essentially we're keeping a table mapping "object revisions" to
pickles.  My definition of an object revision is an oid, serial number
pair (revid = oid+serial).  Serial numbers share lots of properties
with transaction ids <wink>.  Beyond that, we keep a zillion other
metadata tables that make certain operations easier to calculate.  For
example, we keep a mapping of oids to current serial number, so that
we can quickly find the current object rev's pickle data.

Every time an object gets modified, a new pickle is written, along
with an entry in a metadata table.  The metadata table maps revids to
4 other ids: a version id, a "non-version serial", a "live serial",
and a previous serial.  Here's what these mean:

- version id.  Objects can live in a named version (think roughly
  locked long lived transactions) or the "non-version".  When storing
  a new object revision on a version, the version id will be a key
  into a table mapping version ids to version strings.  The API
  specifies only version strings so we need to keep this mapping.
  There's a table that keeps a reverse mapping of version strings to
  version ids.  If this revision isn't in a version, the version id
  will be zero.

- non-version serial.  Let's say you store an object, then you make a
  change to that object in a version.  The metadata for the second
  change will contain a pointer to the most current revision of the
  object in the non-version.  If you abort the version, you'll need
  this pointer to find the state of the object as it existed before
  the version.

- live serial.  The pickle values are not kept in the metadata table
  because multiple revisions may share the pickle data (think
  transactional undo and abort version).  The live serial is really a
  pointer, combined with the oid, to give the entry in the pickles
  table that contains the data for this object revision.

- previous serial.  The serial number of the previous revision of this
  object.  If this is the first object revision, the previous serial
  will be zero.

The metadata table is the key table for tying everything together, and
it's largely this table that's consulted during a pack.  A pack gets
rid of metadata entries, with appropriate reference decrementing and
garbage collection along the way.

There's a table mapping oids to refcounts, which works much like
Python's reference counting scheme.  Say A->B and C->B.  B's refcount
is 2.

There's a table mapping transaction ids to transaction metadata
(i.e. the user, description, and ext stuff in a tpc_begin() call.

There's a mapping between transaction ids and the oids of the objects
modified in that transaction.  This table is useful bookkeeping for
several operations such as transactional undo and packing.

There's a table mapping oids to pickle reference counts.  A pickle's
reference count includes a sum of all the object revisions that point
to this pickle.  Say we create a tree such that A->B, then we break
the link between A and B, then we undo the transaction that broke the
link.  We now have two revisions of object A that point to B, so B's
pickle refcount is 2.  So you can see that the only time we decrement
an pickle's refcount is when we pack away object revisions.

There's a table that maps version ids to the object revids that got
stored in that version (this table is not updated for stores to the
non-version).  Again, useful bookeeping for abortVersion and
commitVersion.

These are the main operational tables.  After this, there are a bunch
of tables to support "auto-recovery".  Let's say we had a system crash
in the middle of tpc_finish().  2PC doesn't allow us to abort the
unfinished transaction, but fortunately, Berkeley's transactions let
us roll back the partially committed (underlying BDB) transaction.
When the storage is opened again, it will perform autorecovery, by
re-committing the last transaction which we promised to commit but
didn't finish.  We do this before accepting any other transaction
starts.  If the failure occurred any time before tpc_finish(), we'll
abort the unfinished transaction.

I'll skip the gory details about the oids, pvids, prevrevids, and
pending tables.  These support autorecovery and the file comments
should be enough to understand their purpose.

We keep a table that has one entry in it (we care only about the
key).  It contains a timestamp (a.k.a. transaction id <wink>) of the
last pack time.  Used in autopack.

We keep a table that contains information about all the packable
object revisions.  Not every revision is packable; the current
revision is not, but when we write a new object revision, we add an
entry into this table pointing to the previous object revision.  There
are exceptions concerning stores to a version (see the comment for
details).  This table is actually a major advantage of the Berkeley
storage because to pack, we only need to start at the front of the
table and iterate until we find a serial number greater than the pack
time.  Each object revision we find is packed away.

Finally, there are three tables that support mark-and-sweep "full
pack" operations.  Note that autopack only packs away old object
revisions, and performs any object removals as indicated by decref'ing
to zero.  But just like with Python, you need a separate way to catch
cyclic garbage that may keep an object cycle alive too long.  The
pack() method, and optionally the autopack thread, can do a full
mark-and-sweep of the object tree, to break and collect cycles.  These
last three tables make sure this is done in a (BerkeleyDB)
transactionally safe way, without bloating memory.

That's probably more information than anybody else cares about, but
I'm tired and couldn't stop myself from spewing this into posterity.

nightie-night-ly y'rs,
-Barry