[ZODB-Dev] BTree memory bomb

Tue Jan 18 10:35:31 EST 2005

[Simon Burton]
> I did a test (below) to see if BTree would unload it's objects as it grew
> large.

Generally speaking, unloading happens only at transaction boundaries.

> No luck, I killed the script once it had taken 80% of memory.

Reading the code, I'm not surprised -- and despite that you didn't say how
much memory you have, so that "80%" doesn't mean much <wink>.

> How do I make a mapping object that can grow arbitrarily large (limited
> by disk space) ?

If this test program is faithful to details of your real application, you've
got real problems.

> This is primarily why I chose ZODB.

...

> from ZODB import FileStorage, DB
> from BTrees.OOBTree import OOBTree
>
> from time import sleep
>
> def main():
>   storage=FileStorage.FileStorage('test.fs')
>   db = DB(storage, cache_size=400)
>   connection = db.open()
>   root = connection.root()
>
>   data = OOBTree()
>
>   root[0] = data
>   print "data:", len(data)
>
>   f = open('/dev/zero')
>
>   for i in xrange(10000):
>     for j in xrange(10000):
>       data[i*10000+j] = f.read(i*128)
>     get_transaction().commit()
>     print "data:", len(data)
>
>   print "sleep"
>   sleep(100)

It's good that you do a commit after each inner loop iteration.  That's on
the right track.  But note that when the outer loop counter has value i, the
inner loop creates strings of total size 10000 * 128 * i = 128e4 * i =
1.28e6 * i = about i MB.  So when i is 1000, you're going to need about a
gigabyte of RAM _just_ to hold the raw string data created by the inner
loop.  None of it can be unloaded before the transaction commits.

Do you really intend to commit multi-gigabyte transactions?  If so, you're
going to need a lot of RAM, or use subtransactions to break them into sane
sizes.  If not, you should change your test driver to model what you do
intend.

There's another detail biting you here:

    print "data:", len(data)

Precisely in order to support scalability across multiple writers, BTrees do
not store their length.  Finding the length of a BTree actually requires
reading the entire thing into memory -- the "len(data)" here is extremely
expensive, in both memory consumption and RAM.  For this reason, even if you
changed the inner loop to create an amount of new data independent of the
outer loop iteration, the "len(data)" part alone requires an amount of RAM
(and time) proportional to the number of entries in `data`.

Most apps don't really care how many entries are in their BTrees, so don't
apply len() to them.  For example, your test driver could replace
"len(data)" with "(i+1)*10000" and get the same output.  If your app does
require finding the size of a BTree frequently, then you could investigate
Zope's BTrees.Length class.