[ZODB-Dev] large C extension objects / MemoryError

Andrew Dalke Andrew Dalke" <dalke@dalkescientific.com
Fri, 26 Oct 2001 01:27:45 -0600


Hello,

  We're working with ZODB outside of Zope.  We are storing
C extension objects ("compounds") which aren't derived from
Persistence but do define a pickle.  Meaning they can be
stored in ZODB.

Previously we haven't had problems storing them in the database
because of the 2GB limit caused by FileStorage meant that we
kept a large part of the data in the file system and loaded it
on demand.

We've shifted to using BerkeleyDB along with Zope-2.4.1 with the
goal of putting all the data in the database rather than
scattered around the file system.  Each compound can be a up
to about 4MB in size when expressed as a binary pickle, and
we have about 17K of them.  The BerkeleyDB directory contains
984 MB, so the average compound size is no more than 58K.

We are able to create the database and commit the result.
We then start a very simple piece of code, which causes a
MemoryError.  The code that causes this is very simple.  It's
something like this (I'm omitting the translation layer that
maps things to the ZODB level, but I think it's apparent).

  missing_confs = []
  has_confs = []

  # 'database' holds a ZODB.DB, and the catalog stores a
  # PersistentMapping, keyed by the compound name.  The
  # 'molecules()' method returns a list of Molecule wrappers
  # which store the compound name and a reference to the
  # storage.  A Molecule is not stored in the database - but it
  # knows how to resolve __getitem__ requests from the database.
  for mol in database.molecules():
    # "num_confs" is computed via a 'Rule', which returns
    # mol["conformers"].Confs()
    if mol["num_confs"] == 0:
      missing_confs.append(mol)
    else:
      has_confs.append(mol)

The exception is (modulo errors from typing, cleanup and commentary)

  File "031.check_confs.py", line 39, in ?
    main(dbname)
  File "031.check_confs.py", line 25, in main
    if mol["num_confs"] == 0:

    --- looking up the 'num_confs' rule
  File "d:\...\metrics\Database.py", line 180 in __getitem__
    return rule.on_get(name, self)

    --- the request results in calling the function 'get_num_confs'
  File "d:\...\metrics\Rule.py", line 304, in on_get
    return self.func(name, obj)

    --- which recusively asks for the "compound" property
  File "d:\...\metrics\ChemRules.py", line 373 in get_num_confs
    cmpd = mol["compound"]

    --- look up the compound
  File "d:\...\metrics\Database.py", line 180 in __getitem__
    return rule.on_get(name, self)

    --- which is stored in a PersistentMapping from the database's 
  File "d:\...\metrics\Rule.py", line 555, in on_get
    return obj.storage[name]

    --- and enters ZODB code
  File "C:\...\ZODB\PersistentMapping.py", line 114, in __getitem__
    return self._container[key]
  File "C:\...\ZODB\Connection.py", line 519, in setstate
    p, serial = self._storage.load(oid, self._version)
  File "C:\Python21\bsddb3Storage\Full.py", line 500, in load
    return self._pickles[oid+lrevid], revid
MemoryError


I'm watching the process size over time.  At the beginning it's
flat, but after about 200 compounds it starts growing.  (I
think the first 200 compounds are small but I haven't checked.)
Then the process size starts growing for every compound.

The process size never goes down, and at #317 I start hitting the
disk.  Waiting a lot longer and I get the MemoryError.

I tried tweaking the cache parameters to reduce the number
of objects in memory.  I did this with
  database.db.setCacheSize(10)
  database.db.setCacheDeactivateAfter(5)

No change in the growth.  I tried to force a cache clear
  database.db.cacheMinimize(0)  # and tried 1
  database.db.cacheFullSweep(0) # and tried 1
but that didn't affect memory usage.

The only two calls to our extension library are to pickle/unpickle
and compute the number of compounds.  I tested those alone and
there were no memory leaks.

My current guess is that ZODB doesn't know how to work with
the C extension.  I can think of three possibilities:

  - the cache size and deactivation values are hints.  The
documentation says they are not strict limits.  Perhaps ZODB
thinks these objects are small so doesn't think need to be
taken out of memory?  I tried looking at source to see how
that determination was done, but got lost.

  - the persistence mechanism is based on bits of Python
magic I don't understand, despite reading Paul's conference
paper.  Perhaps that magic doesn't work for extension types?
Eg, do I need to derive somehow from a Persisent?  Are
ghosted types destructors called if the object is cached?

  - BSDDB3 does its own caching behind ZODB.  However, the
Sleepycat docs say that's only 256 KB.

I tried digging through the documentation, source code,
and back mailing list archives but didn't find any discussion
on this problem.

As an interesting clue (mentioned above in passing) we are
able to make the database.  We just can't visit all the
items once created without running out of memory.

Any suggestions on what I should do next?

                    Andrew Dalke
                    dalke@dalkescientific.com