[ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

Mon May 10 16:58:36 EDT 2010

On Mon, May 10, 2010 at 3:27 PM, Ryan Noon <rmnoon at gmail.com> wrote:
> Hi everyone,
> I recently switched over some of my home-rolled sqlite backed object
> databases into ZODB based on what I'd read and some cool performance numbers
> I'd seen.  I'm really happy with the entire system so far except for one
> really irritating problem: memory usage.
> I'm doing a rather intensive operation where I'm inverting a mapping of the
> form (docid => [wordid]) for about 3 million documents (for about 8 million
> unique words).  I thought about doing it on hadoop, but it's a one time
> thing and it'd be nice if I didn't have to load the data back into an object
> database for my application at the end anyway.
> Anyhoo, in the process of this operation (which performs much faster than my
> sqlite+python cache solution) memory usage never really drops.  I'm
> currently doing a commit every 25k documents.   The python process just
> gobbles up RAM, though.  I made it through 750k documents before my 8GB
> Ubuntu 10.04 server choked and killed the process (at about 80 percent mem
> usage).  (The same thing happens on Windows and OSX, btw).
> I figure either there's a really tremendous bug in ZODB (unlikely given its
> age and venerability) or I'm really doing it wrong.  Here's my code:
>
>         self.storage = FileStorage(self.dbfile, pack_keep_old=False)
>         cache_size = 512 * 1024 * 1024
>
>         self.db = DB(self.storage, pool_size=1, cache_size_bytes=cache_size,
> historical_cache_size_bytes=cache_size, database_name=self.name)
>         self.connection = self.db.open()
>         self.root = self.connection.root()
>
> and the actual insertions...
>             set_default = wordid_to_docset.root.setdefault #i can be kinda
> pathological with loop operations
>             array_append = array.append
>             for docid, wordset in docid_to_wordset.iteritems(): #one of my
> older sqlite oodb's, not maintaining a cache...just iterating (small
> constant mem usage)
>                 for wordid in wordset:
>                     docset = set_default(wordid, array('L'))
>                     array_append(docset, docid)
>
>                 n_docs_traversed += 1
>                 if n_docs_traversed % 1000 == 1:
>                     status_tick()
>                 if n_docs_traversed % 25000 == 1:
>                     self.do_commit() #just commits the oodb by calling
> transaction.commit()
> The DB on the choked process is perfectly good up to the last commit when it
> choked, and I've even tried extremely small values of cache_size_bytes and
> cache_size, just to see if I can get it to stop allocating memory and
> nothing seems to work.  I've also used string values ('128mb') for
> cache-size-bytes, etc.
>
> Can somebody help me out?

The first thing to understand is that options like cache-size and
cache-size bytes are suggestions, not limits. :)  In particular, they
are only enforced:

- at transaction boundaries,

- when an application creates a savepoint,

- or when an application invokes garbage collection explicitly via the
  cacheGC or cacheMinimize methods.

Note that objects that have been monified but not committed won't be
freed even if the suggestions are exceeded.

The reason that ZODB never frees objects on it's own is that doing so
could lead to surprising changes to object state and subtle
bugs. Consider:

    def append(self, item):
        self._data.append(item) # self._data is just a Python dict
        # At this point, ZODB doesn't know that self has changed.
        # If ZODB was willing to free an object whenever it wanted to,
        # self could be freed here, losing the change to self._data.
        self._length += 1
        # Now self is marked as changed, but too late if self was
        # freed above.

Also note that memory allocated by Python is generally not returned to
the OS when freed.

Calling cacheGC at transaction boundaries won't buy you anything.
It's already called then. :)

In your script, I'd recommend calling cacheGC after processing each
document:

   root._p_jar.cacheGC()

This will keep the cache full, which will hopefully help performance
without letting it grow far out of bounds.

Jim

--
Jim Fulton