[ZODB-Dev] Advice on ZODB with large datasets

Wed Jun 18 12:03:10 EDT 2008

We have a large dataset of 650,000+ records that I'd like to examine 
easily in Python.  I have figured out how to put this into a ZODB file 
that totals 4 GB in size.  But I'm new to ZODB and very large databases, 
and have a few questions.

1. The data is in a IOBTree so I can access each item once I know the 
key, but to get the list of keys I tried:

scores = root['scores']
ids = [id for id in scores.iterkeys()]

This seems to require the entire tree to be loaded into memory which 
takes more RAM than I have.

If I instead avoid the list comprehension and use an actual loop, I can 
explicitly call cacheMinimize every n records, and keep the memory 
reasonable.

So, how and when does the cache normally get minimized?  Should I just 
avoid list comprehensions and explicitly clean the cache the way I'm 
doing, or is there any tricks to minimize the RAM usage.

2. Obviously I should save my list of keys in the database.  I'd also 
like to have other indexes.  It appears the usual technique is to use 
ZCatalog <http://www.blazingthings.com/dev/zcatalog.html>.  Am I 
correct?  Is there any good documentation on how to use that with ZODB? 
  (All the examples I can find either were on using the catalog from 
within Zope, to using the catalog in a purely standalone manner.)  Are 
there any concerns I should be aware of for using it with large datasets?

3. Are there any guides to how to tune my ZODB usage?  I had to dig 
around a while for to realize I should be using BTrees and the 
cacheMinimize method.  Are there any other knobs I should know?

So far, I've simply read the data from an XML file and converted it. 
I've set the cache size to 1000, and every 10000 entries, I commit the 
transaction, and minimize the caches.  The conversion takes about 60 
hours to run and uses roughly half my memory, which is acceptable, but 
if I can tune it to be faster at the cost of slightly more memory, I'd 
be happier.  (The performance is roughly O(N^2), although halfway 
through it's closer to O(N^2.7).)

Thanks in advance.