[ZODB-Dev] Historical Persistent References -- Feedback Wanted!

Mon Mar 13 10:23:00 EST 2006

A few months back, the default content types for Plone (ATContentTypes)
switched to using AnnotationStorage instead of AttributeStorage for
the storage of some attributes.  Formerly, properties on a persistent
archetype object were stored as normal object attributes.  Now they are
stored in an OOBtree referenced by an attribute named '__annotation__'.

To make a long story short, the current implementation of
historicalRevision in Zope's OFS/History.py calls oldstate() in
ZODB/Connection.py.  The serializer then calls getState() in
ZODB/serialize.py (class ObjectReader), which sets up an unpickler to
handle persistent references by overriding _persistent_load(). 
Unfortunately, when the _persistent_load subroutine comes across a
persistent reference, it either loads the CURRENT referenced object
from the ZODB (using the oid and ZODB/Connection.py's get()), or loads the
CURRENT referenced object from cache.  It does not take 'tid' into account
when it loads persistent references.

In order for Zope's "history" tab to work for anything other than a very
simple object (with no persistent references), it needs to "deeply" copy
objects out of the ZODB.  In other words, the persistent references we
pull back for a historical revision of an object should be from the same
'tid' as the original object.

My initial thinking was to use _txn_time in Connection.py to set
an upper bound for which revisions could be pulled back, then simply use
the connection and deserializer as we normally would to pull back the
appropriate revisions of everything.  Upon further inspection, though,
it looks to be pretty complex. _setstate_noncurrent is only called by
_load_before_or_conflict, which is called by _setstate based on
_invalidated, which I wouldn't want to touch.  There would also
potentially be issues with caching (although we could probably use a fresh
cache for the "history" connection so it wouldn't conflict with the normal
cache).

My second attempt to resolve this problem is aimed at the serializer
instead.  I wrote a working proof of concept that basically overrides the
normal (de)serializer to:

1. use a "tid" when pulling back historical persistent references
2. get rid of the cache -- we don't care about caching when pulling back
historical revisions of objects

My proof of concept is implemented as TimeTraveler.py, attached below. 
Although it is written as an independent class, it could presumably
subclass class ObjectReader in ZODB/serialize.py -- since that's where it
gets most of its code.

Anyway, I have a few questions:

1. Does it make more sense to patch the serializer (as I've done) or try
to override the max tid on a connection object in order to pull back
"deep" historical revisions? Or are there better ways?

2. What issues might we face using the proof of concept attached below?

3. Would this be better implemented as a patch to Zope, or as a separate
standalone class for pulling back historical revisions of persistent
objects?

4. Does the Zope community see this as a critical issue that needs
to be resolved?

5. Looking forward, what is the best way to "deep copy" a historical
revision to the current revision (or better yet, create a new transaction
that changes the current current revision to contain data from the
historical revision)?  Could I somehow use _p_changed=1 on an old object
to make it so it's automagically copied to the current revision?

The code attached below was tested against a clean Zope 2.8.5 w/ Plone
2.1.2 (debian 'unstable' packages).  I tested things using "zopectl
debug".

Thanks in advance for any feedback you can provide...

Matt Hahnfeld
matth at everysoft.com

---

import OFS.History
import ZODB.broken
import cPickle
import cStringIO

# TimeTraveler
# ZODB (deep) historical objects -- proof of concept
# Matt Hahnfeld 3/10/06
#
# Most of this code was stolen from ZODB/Connection.py
# or ZODB/serialize.py.  This should probably subclass
# ZODB.serialize.ObjectReader.
#
# Usage:
#
# get tid from p._p_jar._storage.history(p._p_oid, size=10)
#
# from TimeTraveler import TimeTraveler
# p = app.my_plone_site.my_page
# tt = TimeTraveler(p,'\x03c\xff\xcd\x15X\x80\xcc')
# p_old = tt.get()
#
# p_old will be a deep copy of p for the tid specified.

class TimeTraveler:

    def __init__(self, obj, tid):
        self._obj = obj
        self._tid = tid
        self._conn = self._obj._p_jar
        self._storage = self._conn._storage
        self._factory = self._conn._db.classFactory

    def get(self):
        obj = self._get_object(self._obj._p_oid)
        return obj.__of__(self._obj.aq_parent)

    def _get_object(self, oid):
        print 'in getobj'
        pickle = self._storage.loadSerial(oid, self._tid)
        unpickler = self._get_unpickler(pickle)
        klass = unpickler.load()
        if isinstance(klass, tuple):
            # Here we have a separate class and args.
            # This could be an old record, so the class module ne a named
            # refernce
            klass, args = klass
            print 'klass is '+str(klass)
            if isinstance(klass, tuple):
                # Old module_name, class_name tuple
                klass = self._get_class(*klass)

            if args is None:
                args = ()
        else:
            # Definitely new style direct class reference
            args = ()

        if issubclass(klass, ZODB.broken.Broken):
            # We got a broken class. We might need to make it
            # PersistentBroken
            if not issubclass(klass, ZODB.broken.PersistentBroken):
                klass = ZODB.broken.persistentBroken(klass)

        obj = klass.__new__(klass, *args)
        state = unpickler.load()
        obj.__setstate__(state)

        # set other attributes
        obj._p_jar=OFS.History.HystoryJar(self._conn)
        obj._p_oid=oid
        obj._p_serial=self._tid
        obj._p_changed=0

        return obj

    def _get_unpickler(self, pickle):
        file = cStringIO.StringIO(pickle)
        unpickler = cPickle.Unpickler(file)
        unpickler.persistent_load = self._persistent_load
        factory = self._factory
        conn = self._conn

        def find_global(modulename, name):
            print 'in find global\n'
            return factory(conn, modulename, name)

        unpickler.find_global = find_global

        return unpickler

    def _get_class(self, module, name):
        return self._factory(self._conn, module, name)

    def _persistent_load(self, oid):
        print 'in persistent load\n'
        if isinstance(oid, list):
            # weakref
            [oid] = oid
            obj = WeakRef.__new__(WeakRef)
            obj.oid = oid
            obj.dm = self._conn
            return obj
        elif isinstance(oid, tuple):
            oid = oid[0]
        return self._get_object(oid)