[ZODB-Dev] large C extension objects / MemoryError

Andrew Dalke Andrew Dalke" <dalke@dalkescientific.com
Fri, 26 Oct 2001 22:42:01 -0600


Steve Alexander:
>You could try adding them to an OOBTree rather than a PersistentMapping. 
>At least with a BTree, each of its Buckets is a separate persistent 
>object. That might help you limit the amount of data you load in when an 
>object is activated.

I tried that, and rebuilt the database.  Still the same problem.

Specifically, I am able to populate the database and commit
the results.  I then revisit every record to compute a value based
on a C extension.  At some point through get a MemoryError.
A record is called a Molecule.  The C extension is called a
Compound.

I have harder numbers now.  There are 17,550 Molecules in the
database, each with a Compound.  The largest sized field commited
to the database (as per Barry's suggestion) is just over 2MB.  There
are about 15 between 1MB and 2MB.  The average size is 754 bytes,
most likely because a lot of other small chunks of data are stored.
I don't know how to tell the average size of each Compound, but
if I do

cat sizes.txt | tail -17550 | uniq | awk '{sum += $1} END {print sum}'

I get 232243740 bytes, or an upper limit of 14K per Compound.
I have a getCacheSize of 400 and the machine has either 500 or 1000
MB of memory.  So I don't think it's my original assumption, which
was that I had a few large Compounds taking up a lot of memory
but whose memory size was disguised from ZODB.

I'm concerned that the cache isn't being properly cleared.  I
reopened the database after the exception (recall that all the
data was committed safely).  I find that

>>> db = Database.open_database("CDK2.1.bf")
>>> db.db.cacheSize()
18411
>>> db.db.getCacheSize()
400
>>>

I interpret "cacheSize" as returning the number of elements
currently in the cache, while "getCacheSize" returns the requested
(soft) upper limit of the number of objects.  So I assumed
these numbers should be closer to each other.

This database build used the default values of everything in ZODB.

I fiddled around a bit and somehow made the cache small again.
Don't know how (exiting without an exception?)  I started from
scratch and found this behaviour, which I don't understand.

>>> from metrics import Database
>>> db = Database.open_database("CDK2.1.bf")
>>> db.db.cacheSize()
6
>>> db.db.getCacheSize()
400
>>> db._molecule_lookup
<OOBTree object at 01120C10>
>>> db._molecule_lookup.keys()[:2]
<OOBTreeItems object at 02320BC0>
>>> db.db.cacheSize()
236
>>> for smi in db._molecule_lookup.keys():
...     pass
...
>>> smi
'n1nn[nH] ... this is proprietary ... C2'
>>> type(smi)
<type 'string'>
>>> db.db.cacheSize()
18411
>>> db.db.cacheMinimize(0)
>>> db.db.cacheSize()
18411
>>> db.db.cacheLastGCTime()
1004156375.0
>>> import time
>>> time.time()
1004156727.906
>>> db.b.cacheFullSweep(0)
>>> db.db.cacheLastGCTime()
1004156375.0
>>> db.db.cacheFullSweep(time.time())
>>> db.db.cacheLastGCTime()
1004156375.0
>>> db.db.cacheFullSweep(1)
>>> db.db.cacheLastGCTime()
1004156375.0
>>> db.db.objectCount()
139311
>>> db.db.cacheSize()
18411
>>> dir()
['Database', '__builtins__', '__doc__', '__name__', 'db', 'smi', 'time']
>>> 

(That last bit to show to myself that I didn't accidentally store any
data in a local variable.)

I would have thought at least one of these calls would clear up
the cache.  Is there a way to force the cache to be cleared?  If
so, I can do that partway through my processing and see if that
gets rid of my problem.

I guess the next is to try a FileStorage to see if that does
anything different.  (It shouldn't, IMO.)  I think if I keep
(re)packing the data I can keep the file under the 2GB limit.

Again, this is ZODB from Zope-2.4.1 with bsddbStorage taken from
CVS last week.  I'm on NT with Python 2.1.1 and using the MKS
toolkit to keep my unix fingers somewhat sane.  :)

                    Andrew
                    dalke@dalkescientific.com