[ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

Tue May 11 19:37:20 EDT 2010

Hi Jim,

I'm really sorry for the miscommunication, I thought I made that clear in my
last email:

"I'm wrapping ZODB in a 'ZMap' class that just forwards all the dictionary
methods to the ZODB root and allows easy interchangeability with my old
sqlite OODB abstraction."

wordid_to_docset is a "ZMap", which just wraps the ZODB
boilerplate/connection and forwards dictionary methods to the root.  If this
seems superfluous, it was just to maintain backwards compatibility with all
of the code I'd already written for the sqlite OODB I was using before I
switched to ZODB.  Whenever you see something like wordid_to_docset[id] it's
just doing self.root[id] behind the scenes in a __setitem__ call inside the
ZMap class, which I've pasted below.

The db is just storing longs mapped to array('L')'s with a few thousand
longs in em.  I'm going to try switching to the persistent data structure
that Laurence suggested (a pointer to relevant documentation would be really
useful), but I'm still sorta worried because in my experimentation with ZODB
so far I've never been able to observe it sticking to any cache limits, no
matter how often I tell it to garbage collect (even when storing very small
values that should give it adequate granularity...see my experiment at the
end of my last email).  If the memory reported to the OS by Python 2.6 is
the problem I'd understand, but memory usage goes up the second I start
adding new things (which indicates that Python is asking for more and not
actually freeing internally, no?).

If you feel there's something pathological about my memory access patterns
in this operation I can just do the actual inversion step in Hadoop and load
the output into ZODB for my application later, I was just hoping to keep all
of my data in OODB's the entire time.

Thanks again all of you for your collective time.  I really like ZODB so
far, and it bugs me that I'm likely screwing it up somewhere.

Cheers,
Ryan

class ZMap(object):

    def __init__(self, name=None, dbfile=None, cache_size_mb=512,
autocommit=True):
        self.name = name
        self.dbfile = dbfile
        self.autocommit = autocommit

        self.__hash__ = None #can't hash this

        #first things first, figure out if we need to make up a name
        if self.name == None:
            self.name = make_up_name()
        if sep in self.name:
            if self.name[-1] == sep:
                self.name = self.name[:-1]
            self.name = self.name.split(sep)[-1]

        if self.dbfile == None:
            self.dbfile = self.name + '.zdb'

        self.storage = FileStorage(self.dbfile, pack_keep_old=False)
        self.cache_size = cache_size_mb * 1024 * 1024

        self.db = DB(self.storage, pool_size=1,
cache_size_bytes=self.cache_size,
historical_cache_size_bytes=self.cache_size, database_name=self.name)
        self.connection = self.db.open()
        self.root = self.connection.root()

        print 'Initializing ZMap "%s" in file "%s" with %dmb cache. Current
%d items' % (self.name, self.dbfile, cache_size_mb, len(self.root))

    # basic operators
    def __eq__(self, y): # x == y
        return self.root.__eq__(y)
    def __ge__(self, y): # x >= y
        return len(self) >= len(y)
    def __gt__(self, y): # x > y
        return len(self) > len(y)
    def __le__(self, y): # x <= y
        return not self.__gt__(y)
    def __lt__(self, y): # x < y
        return not self.__ge__(y)
    def __len__(self): # len(x)
        return len(self.root)

    # dictionary stuff
    def __getitem__(self, key): # x[key]
        return self.root[key]

    def __setitem__(self, key, value): # x[key] = value
        self.root[key] = value
        self.__commit_check() # write back if necessary

    def __delitem__(self, key): # del x[key]
        del self.root[key]

    def get(self, key, default=None): # x[key] if key in x, else default
        return self.root.get(key, default)

    def has_key(self, key): # True if x has key, else False
        return self.root.has_key(key)

    def items(self): # list of key/val pairs
        return self.root.items()

    def keys(self):
        return self.root.keys()

    def pop(self, key, default=None):
        return self.root.pop()

    def popitem(self): #remove and return an arbitrary key/val pair
        return self.root.popitem()

    def setdefault(self, key, default=None):
        #D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D
        return self.root.setdefault(key, default)

    def values(self):
        return self.root.values()

    def copy(self): #copy it? dubiously necessary at the moment
        NOT_IMPLEMENTED('copy')

    # iteration
    def __iter__(self): # iter(x)
        return self.root.iterkeys()

    def iteritems(self): #iterator over items, this can be hellaoptimized
        return self.root.iteritems()

    def itervalues(self):
        return self.root.itervalues()

    def iterkeys(self):
        return self.root.iterkeys()

    # practical realities of the abstraction
    def garbage_collect(self):
        self.root._p_jar.cacheGC()
        #self.connection.cacheGC()

    def commit(self):
        return self.__commit_check(force=True)

    def __commit_check(self, force=False):
        if self.autocommit or force:
            transaction.commit()

On Tue, May 11, 2010 at 3:50 AM, Jim Fulton <jim at zope.com> wrote:

> On Mon, May 10, 2010 at 8:20 PM, Ryan Noon <rmnoon at gmail.com> wrote:
> > P.S. About the data structures:
> > wordset is a freshly unpickled python set from my old sqlite oodb thingy.
> > The new docsets I'm keeping are 'L' arrays from the stdlib array module.
> >  I'm up for using ZODB's builtin persistent data structures if it makes a
> > lot of sense to do so, but it sorta breaks my abstraction a bit and I
> feel
> > like the memory issues I'm having are somewhat independent of the
> container
> > data structures (as I'm having the same issue just with fixed size
> strings).
>
> This is getting tiresome.  We can't really advise you because we can't
> see what data structures you're using and we're wasting too much time
> guessing. We wouldn't have to guess and grill you if you showed a
> complete demonstration program, or at least one that showed what the
> heck your doing.
>
> The program you've showed so far is so incomplete, perhaps we're
> missing the obvious.
>
> In your original program, you never actually store anything in the
> database. You assign the database root to self.root, but never use
> self.root. (The variable self is not defined and we're left to assume
> that this disembodied code is part of a method definition.) In your
> most recent snippet, you don't show any database access. If you
> never actually store anything in the database, then nothing will be
> removed from memory.
>
> You're inserting data into wordid_to_docset, but you don't show its
> definition and won't tell us what it is.
>
> Jim
>
> --
> Jim Fulton
>

-- 
Ryan Noon
Stanford Computer Science
BS '09, MS '10
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.zope.org/pipermail/zodb-dev/attachments/20100511/026d9e5b/attachment-0001.html