[ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

Mon May 10 20:16:20 EDT 2010

Hi all,

I've incorporated everybody's advice, but I still can't get memory to obey
cache-size-bytes.  I'm using the new 3.10 from pypi (but the same behavior
happens on the server where I was using 3.10 from the new lucid apt repos).

I'm going through a mapping where we take one long integer "docid" and map
it to a collection of long integers ("wordset") and trying to invert it into
a mapping for each '"wordid" in those wordsets to a set of the original
docids ("docset").

I've even tried calling cacheMinimize after every single docset append, but
reported memory to the OS never goes down and the process continues to
allocate like crazy.

I'm wrapping ZODB in a "ZMap" class that just forwards all the dictionary
methods to the ZODB root and allows easy interchangeability with my old
sqlite OODB abstraction.

Here's the latest version of my code, (minorly instrumented...see below):

        try:
            max_docset_size = 0
            for docid, wordset in docid_to_wordset.iteritems():
                for wordid in wordset:
                    if wordid_to_docset.has_key(wordid):
                        docset = wordid_to_docset[wordid]
                    else:
                        docset = array('L')
                    docset.append(docid)
                    if len(docset) > max_docset_size:
                        max_docset_size = len(docset)
                        print 'Max docset is now %d (owned by wordid %d)' %
(max_docset_size, wordid)
                    wordid_to_docset[wordid] = docset
                    wordid_to_docset.garbage_collect()
                    wordid_to_docset.connection.cacheMinimize()

                n_docs_traversed += 1

                if n_docs_traversed % 100 == 1:
                    status_tick()
                if n_docs_traversed % 50000 == 1:
                    self.do_commit()

            self.do_commit()
        except KeyboardInterrupt, ex:
            self.log_write('Caught keyboard interrupt, committing...')
            self.do_commit()

I'm keeping track of the greatest docset (which would be the largest
possible thing not able to be paged out) and its only 10,152 longs (at 8
bytes each according to the array module's documentation) at the point 75
seconds into the operation when the process has allocated 224 MB (on a
cache_size_bytes of 64*1024*1024).

On a lark I just made an empty ZMap in the interpreter and filled it with 1M
unique strings.  It took up something like 190mb.  I committed it and mem
usage went up to 420mb.  I then ran cacheMinimize (memory stayed at 420mb).
 Then I inserted another 1M entries (strings keyed on ints) and mem usage
went up to 820mb.  Then I committed and memory usage dropped to ~400mb and
went back up to 833mb.  Then I ran cacheMinimize again and memory usage
stayed there.  Does this example (totally decoupled from any other
operations by me) make sense to experienced ZODB people?  I have really no
functional mental model of ZODB's memory usage patterns.  I love using it,
but I really want to find some way to get its allocations under control.
 I'm currently running this on a Macbook Pro, but it seems to be behaving
the same way on Windows and Linux.

I really appreciate all of the help so far, and if there're any other pieces
of my code that might help please let me know.

Cheers,
Ryan

On Mon, May 10, 2010 at 3:18 PM, Jim Fulton <jim at zope.com> wrote:

> On Mon, May 10, 2010 at 5:39 PM, Ryan Noon <rmnoon at gmail.com> wrote:
> > First off, thanks everybody.  I'm implementing and testing the
> suggestions
> > now.  When I said ZODB was more complicated than my solution I meant that
> > the system was abstracting a lot more from me than my old code (because I
> > wrote it and new exactly how to make the cache enforce its limits!).
> >
> >> > The first thing to understand is that options like cache-size and
> >> > cache-size bytes are suggestions, not limits. :)  In particular, they
> >> > are only enforced:
> >> >
> >> > - at transaction boundaries,
> >
> > If it's already being called at transaction boundaries how come memory
> usage
> > doesn't go back down to the quota after the commit (which is only every
> 25k
> > documents?).
>
> Because Python generally doesn't return memory back to the OS. :)
>
> It's also possible you have a problem with one of your data
> structures.  For example if you have an array that grows effectively
> without bound, the array will have to be in memory, no matter how big
> it is.  Also, if the persistent object holding the array isn't seen as
> changed, because you're appending to the array, then the size of the
> array won't be reflected in the cache size. (The size of objects in
> the cache is estimated from their pickle sizes.)
>
> I assume you're using ZODB 3.9.5 or later. If not, there's a bug in
> handling new objects that prevents cache suggestions from working
> properly.
>
> If you don't need list semantics, and set semantics will do, you might
> consider using an BTrees.LLBtree.TreeSet, which provides compact
> scalable persistent sets.  (If your word ids can be signed, you could
> ise the IIBTree variety, which is more compact.) Given the variable
> name is wordset, then I assume you're dealing with sets. :)
>
> What is wordid_to_docset? You don't show it's creation.
>
> Jim
>
> --
> Jim Fulton
>

-- 
Ryan Noon
Stanford Computer Science
BS '09, MS '10
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.zope.org/pipermail/zodb-dev/attachments/20100510/59d29a32/attachment.html