[ZODB-Dev] Support for graceful ZODB Class renaming

Thu, 16 Jan 2003 15:14:25 -0500

Problem

   A long-standing problem in ZODB is that renaming/moving classes
   or modules is painful, because module and class names are scattered
   throughout databases.

   For example, consider a class named C, stored in module x.y.z.
   The database records for persistent instances of the class contain
   a pickle of a tuple containing the module and class names.  Similarly,
   pickles of containing objects have pickles of the same tuples.

   If the instances of the class are non-persistent, then the database
   contains "global" pickles for the classes wherever there is an instance
   pickle.

   If we wish to either rename the class or the module, or move the
   class to a different module, then we have a problem, because we have
   have lots of pickles with the old name that will be unloadable if
   the old names become invalid.

   Define the "dotted name" of a class to be some combination of the
   module and class name. We have a problem when the dotted name of a
   class changes.  (This problem extends to other global objects, but
   classes provide the most common and compelling source of the problem.)

   Because ZODB 4 is still in an early stage of development, this seems
   like an opportune time to consider solutions to this problem.

Possible solutions

   1. The classic solution to this problem was to create aliases for the
      old names.

      For example, suppose we renamed x.y, to x.q. We'd also modify x.q's
      __init__.py to create an alias in sys.modules::

        sys.modules['x.y'] = sys.modules['x.q']

      We'd create a similar alias in x.q.z:

        sys.modules['x.y.z'] = sys.modules['x.q.z']

      This is a bit of a bother.

      This could be cleaned up a bit if there was an alias table that
      one could create (probably with an include mechanism) to collect these
      operations together.

      A bother with this approach is that the aliases need to be maintained
      as long as the old pickles exist in the database, which could be
      indefinitely.

      A real problem with this approach is that we could end up
      unpickling objects with the wrong class if the old names get
      reused by new classes.  For example, suppose that, after renaming
      x.y, we create a new x.y with a z containing a C. This new C
      class would be instantiated for pickles that should really get
      the x.q.z.C class. This requires enough bad luck, however, that
      we haven't been bitten by it yet AFAIK,

   2. Another approach would be to write a data conversion utility for the
      database. This would require a conversion file much like the alias file
      described above.

      You might have to shut down the database while you do the
      conversion, resulting in down time, however, if you combined the
      aliasing approach with conversion, you could avoid the down time.

      Suppose, for example, that you had an alias table mapping old to
      new dotted object names.  We can use the database without
      modifications if we provide a "global" loader that uses this alias
      file (or if we have a utility that manipulates sys.modules on
      start up).

      We can write a utility for file storage, similar to a
      pack, that makes a live copy of the storage file, containing
      converted records and that switches to the new file when the
      conversion is complete.  For many other storages, we could perform
      the fix ups in-place, which is even more attractive.

   3. A more sophisticated approach is to build a table, stored in the
      database providing a two-way mapping between a unique id and a
      class module and name.  The ids could be assigned automatically.
      When pickling a class, we'd pickle the id, rather than the module
      and class name. When unpickling a class, we'd lookup the module
      and class name in the table.

      As with option 2, an explicit operation is needed to change dotted
      class names. As with option 2, aliases could be used to minimize
      down time.  Unlike option 2, the update operation could be really
      fast, because we only need to update a single table.

      A secondary benefit of this approach is that pickle sizes can be
      reduced substantially, because class ids, rather than dotted names
      are stored.

      A downside of this approach is that misshapes in managing the id
      table would be quite serious. For example, if a database record
      containing the class ids is lost due to database corruption, large
      portions of the database would become unusable.   There are
      various ways that this risk could be mitigated. For example, we
      could keep assigned ids in a redundant file, possibly using a
      simple log file.

      Another disadvantage of this approach is that the ZODB software,
      including storage implementations, has to be more sophisticated
      to deal with the id to global mapping.

   4. A variation on approach 3 is to have class authors explicitly
      assign globally unique IDs (GUIDs) to classes.  These GUIDS would
      be used rather than randomly selected ids.  This is a fairly
      significant burden to place on class developers. GUIDs also
      require more space that locally assigned ids.

      An advantage of GUIDs is that GUIDs can be recovered from class
      source files, so that there is a built-in redundancy in the
      management of ids.

      It's possible that GUIDs could be an optional feature of approach
      3.

I'm inclined to go with option 2 because it is:

- Overall, it is simpler, although the conversion aspect is more
   complicated.

- It has no risk of lost id information.

Thoughts?

Jim

-- 
Jim Fulton           mailto:jim@zope.com       Python Powered!
CTO                  (888) 344-4332            http://www.python.org
Zope Corporation     http://www.zope.com       http://www.zope.org