Thanks Laurence, this looks really helpful.  The simplicity of ZODB&#39;s concept and the joy of using it apparently hides some of the complexity necessary to use it efficiently.  I&#39;ll check this out when I circle back to data stuff tomorrow.<div>


<br></div><div>Have a great morning/day/evening!</div><div>-Ryan<br><br><div class="gmail_quote">On Tue, May 11, 2010 at 5:44 PM, Laurence Rowe <span dir="ltr">&lt;<a href="mailto:l@lrowe.co.uk">l@lrowe.co.uk</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">I think this means that you are storing all of your data in a single<br>

persistent object, the database root PersistentMapping. You need to<br>

break up your data into persistent objects (instances of objects that<br>

inherit from persistent.Persistent) for the ZODB to have a chance of<br>

performing memory mapping. You want to do something like:<br>

<br>

import transaction<br>

from ZODB import FileStorage, DB<br>

from BTrees.LOBTree import BTree, TreeSet<br>

storage = FileStorage.FileStorage(&#39;/tmp/test-filestorage.fs&#39;)<br>

db = DB(storage)<br>

conn = db.open()<br>

root = conn.root()<br>

transaction.begin()<br>

index = root[&#39;index&#39;] = BTree()<br>

values = index[1] = TreeSet()<br>

values.add(42)<br>

transaction.commit()<br>

<br>

You should probably read:<br>

<a href="http://www.zodb.org/documentation/guide/modules.html#btrees-package" target="_blank">http://www.zodb.org/documentation/guide/modules.html#btrees-package</a>.<br>

Since that was written an L variants of the BTree types have been<br>

introduced for storing 64bit integers. I&#39;m using an LOBTree because<br>

that maps 64bit integers to python objects. For values I&#39;m using an<br>

LOTreeSet, though you could also use an LLTreeSet (which has larger<br>

buckets).<br>

<font color="#888888"><br>

Laurence<br>

</font><div><div></div><div class="h5"><br>

On 12 May 2010 00:37, Ryan Noon &lt;<a href="mailto:rmnoon@gmail.com">rmnoon@gmail.com</a>&gt; wrote:<br>

&gt; Hi Jim,<br>

&gt; I&#39;m really sorry for the miscommunication, I thought I made that clear in my<br>

&gt; last email:<br>

&gt; &quot;I&#39;m wrapping ZODB in a &#39;ZMap&#39; class that just forwards all the dictionary<br>

&gt; methods to the ZODB root and allows easy interchangeability with my old<br>

&gt; sqlite OODB abstraction.&quot;<br>

&gt; wordid_to_docset is a &quot;ZMap&quot;, which just wraps the ZODB<br>

&gt; boilerplate/connection and forwards dictionary methods to the root.  If this<br>

&gt; seems superfluous, it was just to maintain backwards compatibility with all<br>

&gt; of the code I&#39;d already written for the sqlite OODB I was using before I<br>

&gt; switched to ZODB.  Whenever you see something like wordid_to_docset[id] it&#39;s<br>

&gt; just doing self.root[id] behind the scenes in a __setitem__ call inside the<br>

&gt; ZMap class, which I&#39;ve pasted below.<br>

&gt; The db is just storing longs mapped to array(&#39;L&#39;)&#39;s with a few thousand<br>

&gt; longs in em.  I&#39;m going to try switching to the persistent data structure<br>

&gt; that Laurence suggested (a pointer to relevant documentation would be really<br>

&gt; useful), but I&#39;m still sorta worried because in my experimentation with ZODB<br>

&gt; so far I&#39;ve never been able to observe it sticking to any cache limits, no<br>

&gt; matter how often I tell it to garbage collect (even when storing very small<br>

&gt; values that should give it adequate granularity...see my experiment at the<br>

&gt; end of my last email).  If the memory reported to the OS by Python 2.6 is<br>

&gt; the problem I&#39;d understand, but memory usage goes up the second I start<br>

&gt; adding new things (which indicates that Python is asking for more and not<br>

&gt; actually freeing internally, no?).<br>

&gt; If you feel there&#39;s something pathological about my memory access patterns<br>

&gt; in this operation I can just do the actual inversion step in Hadoop and load<br>

&gt; the output into ZODB for my application later, I was just hoping to keep all<br>

&gt; of my data in OODB&#39;s the entire time.<br>

&gt; Thanks again all of you for your collective time.  I really like ZODB so<br>

&gt; far, and it bugs me that I&#39;m likely screwing it up somewhere.<br>

&gt; Cheers,<br>

&gt; Ryan<br>

&gt;<br>

&gt;<br>

&gt; class ZMap(object):<br>

&gt;<br>

&gt;     def __init__(self, name=None, dbfile=None, cache_size_mb=512,<br>

&gt; autocommit=True):<br>

&gt;         <a href="http://self.name" target="_blank">self.name</a> = name<br>

&gt;         self.dbfile = dbfile<br>

&gt;         self.autocommit = autocommit<br>

&gt;<br>

&gt;         self.__hash__ = None #can&#39;t hash this<br>

&gt;<br>

&gt;         #first things first, figure out if we need to make up a name<br>

&gt;         if <a href="http://self.name" target="_blank">self.name</a> == None:<br>

&gt;             <a href="http://self.name" target="_blank">self.name</a> = make_up_name()<br>

&gt;         if sep in <a href="http://self.name" target="_blank">self.name</a>:<br>

&gt;             if <a href="http://self.name" target="_blank">self.name</a>[-1] == sep:<br>

&gt;                 <a href="http://self.name" target="_blank">self.name</a> = <a href="http://self.name" target="_blank">self.name</a>[:-1]<br>

&gt;             <a href="http://self.name" target="_blank">self.name</a> = self.name.split(sep)[-1]<br>

&gt;<br>

&gt;<br>

&gt;         if self.dbfile == None:<br>

&gt;             self.dbfile = <a href="http://self.name" target="_blank">self.name</a> + &#39;.zdb&#39;<br>

&gt;<br>

&gt;         self.storage = FileStorage(self.dbfile, pack_keep_old=False)<br>

&gt;         self.cache_size = cache_size_mb * 1024 * 1024<br>

&gt;<br>

&gt;         self.db = DB(self.storage, pool_size=1,<br>

&gt; cache_size_bytes=self.cache_size,<br>

&gt; historical_cache_size_bytes=self.cache_size, database_name=<a href="http://self.name" target="_blank">self.name</a>)<br>

&gt;         self.connection = self.db.open()<br>

&gt;         self.root = self.connection.root()<br>

&gt;<br>

&gt;         print &#39;Initializing ZMap &quot;%s&quot; in file &quot;%s&quot; with %dmb cache. Current<br>

&gt; %d items&#39; % (<a href="http://self.name" target="_blank">self.name</a>, self.dbfile, cache_size_mb, len(self.root))<br>

&gt;<br>

&gt;     # basic operators<br>

&gt;     def __eq__(self, y): # x == y<br>

&gt;         return self.root.__eq__(y)<br>

&gt;     def __ge__(self, y): # x &gt;= y<br>

&gt;         return len(self) &gt;= len(y)<br>

&gt;     def __gt__(self, y): # x &gt; y<br>

&gt;         return len(self) &gt; len(y)<br>

&gt;     def __le__(self, y): # x &lt;= y<br>

&gt;         return not self.__gt__(y)<br>

&gt;     def __lt__(self, y): # x &lt; y<br>

&gt;         return not self.__ge__(y)<br>

&gt;     def __len__(self): # len(x)<br>

&gt;         return len(self.root)<br>

&gt;<br>

&gt;<br>

&gt;     # dictionary stuff<br>

&gt;     def __getitem__(self, key): # x[key]<br>

&gt;         return self.root[key]<br>

&gt;     def __setitem__(self, key, value): # x[key] = value<br>

&gt;         self.root[key] = value<br>

&gt;         self.__commit_check() # write back if necessary<br>

&gt;<br>

&gt;     def __delitem__(self, key): # del x[key]<br>

&gt;         del self.root[key]<br>

&gt;<br>

&gt;     def get(self, key, default=None): # x[key] if key in x, else default<br>

&gt;         return self.root.get(key, default)<br>

&gt;     def has_key(self, key): # True if x has key, else False<br>

&gt;         return self.root.has_key(key)<br>

&gt;     def items(self): # list of key/val pairs<br>

&gt;         return self.root.items()<br>

&gt;     def keys(self):<br>

&gt;         return self.root.keys()<br>

&gt;     def pop(self, key, default=None):<br>

&gt;         return self.root.pop()<br>

&gt;     def popitem(self): #remove and return an arbitrary key/val pair<br>

&gt;         return self.root.popitem()<br>

&gt;     def setdefault(self, key, default=None):<br>

&gt;         #D.setdefault(k[,d]) -&gt; D.get(k,d), also set D[k]=d if k not in D<br>

&gt;         return self.root.setdefault(key, default)<br>

&gt;     def values(self):<br>

&gt;         return self.root.values()<br>

&gt;<br>

&gt;     def copy(self): #copy it? dubiously necessary at the moment<br>

&gt;         NOT_IMPLEMENTED(&#39;copy&#39;)<br>

&gt;<br>

&gt;<br>

&gt;     # iteration<br>

&gt;     def __iter__(self): # iter(x)<br>

&gt;         return self.root.iterkeys()<br>

&gt;<br>

&gt;     def iteritems(self): #iterator over items, this can be hellaoptimized<br>

&gt;         return self.root.iteritems()<br>

&gt;<br>

&gt;     def itervalues(self):<br>

&gt;         return self.root.itervalues()<br>

&gt;     def iterkeys(self):<br>

&gt;         return self.root.iterkeys()<br>

&gt;<br>

&gt;     # practical realities of the abstraction<br>

&gt;     def garbage_collect(self):<br>

&gt;         self.root._p_jar.cacheGC()<br>

&gt;         #self.connection.cacheGC()<br>

&gt;<br>

&gt;     def commit(self):<br>

&gt;         return self.__commit_check(force=True)<br>

&gt;<br>

&gt;     def __commit_check(self, force=False):<br>

&gt;         if self.autocommit or force:<br>

&gt;             transaction.commit()<br>

&gt;<br>

&gt;<br>

&gt; On Tue, May 11, 2010 at 3:50 AM, Jim Fulton &lt;<a href="mailto:jim@zope.com">jim@zope.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; On Mon, May 10, 2010 at 8:20 PM, Ryan Noon &lt;<a href="mailto:rmnoon@gmail.com">rmnoon@gmail.com</a>&gt; wrote:<br>

&gt;&gt; &gt; P.S. About the data structures:<br>

&gt;&gt; &gt; wordset is a freshly unpickled python set from my old sqlite oodb<br>

&gt;&gt; &gt; thingy.<br>

&gt;&gt; &gt; The new docsets I&#39;m keeping are &#39;L&#39; arrays from the stdlib array module.<br>

&gt;&gt; &gt;  I&#39;m up for using ZODB&#39;s builtin persistent data structures if it makes<br>

&gt;&gt; &gt; a<br>

&gt;&gt; &gt; lot of sense to do so, but it sorta breaks my abstraction a bit and I<br>

&gt;&gt; &gt; feel<br>

&gt;&gt; &gt; like the memory issues I&#39;m having are somewhat independent of the<br>

&gt;&gt; &gt; container<br>

&gt;&gt; &gt; data structures (as I&#39;m having the same issue just with fixed size<br>

&gt;&gt; &gt; strings).<br>

&gt;&gt;<br>

&gt;&gt; This is getting tiresome.  We can&#39;t really advise you because we can&#39;t<br>

&gt;&gt; see what data structures you&#39;re using and we&#39;re wasting too much time<br>

&gt;&gt; guessing. We wouldn&#39;t have to guess and grill you if you showed a<br>

&gt;&gt; complete demonstration program, or at least one that showed what the<br>

&gt;&gt; heck your doing.<br>

&gt;&gt;<br>

&gt;&gt; The program you&#39;ve showed so far is so incomplete, perhaps we&#39;re<br>

&gt;&gt; missing the obvious.<br>

&gt;&gt;<br>

&gt;&gt; In your original program, you never actually store anything in the<br>

&gt;&gt; database. You assign the database root to self.root, but never use<br>

&gt;&gt; self.root. (The variable self is not defined and we&#39;re left to assume<br>

&gt;&gt; that this disembodied code is part of a method definition.) In your<br>

&gt;&gt; most recent snippet, you don&#39;t show any database access. If you<br>

&gt;&gt; never actually store anything in the database, then nothing will be<br>

&gt;&gt; removed from memory.<br>

&gt;&gt;<br>

&gt;&gt; You&#39;re inserting data into wordid_to_docset, but you don&#39;t show its<br>

&gt;&gt; definition and won&#39;t tell us what it is.<br>

&gt;&gt;<br>

&gt;&gt; Jim<br>

&gt;&gt;<br>

&gt;&gt; --<br>

&gt;&gt; Jim Fulton<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; --<br>

&gt; Ryan Noon<br>

&gt; Stanford Computer Science<br>

&gt; BS &#39;09, MS &#39;10<br>

&gt;<br>

</div></div><div><div></div><div class="h5">&gt; _______________________________________________<br>

&gt; For more information about ZODB, see the ZODB Wiki:<br>

&gt; <a href="http://www.zope.org/Wikis/ZODB/" target="_blank">http://www.zope.org/Wikis/ZODB/</a><br>

&gt;<br>

&gt; ZODB-Dev mailing list  -  <a href="mailto:ZODB-Dev@zope.org">ZODB-Dev@zope.org</a><br>

&gt; <a href="https://mail.zope.org/mailman/listinfo/zodb-dev" target="_blank">https://mail.zope.org/mailman/listinfo/zodb-dev</a><br>

&gt;<br>

&gt;<br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>Ryan Noon<br>Stanford Computer Science<br>BS &#39;09, MS &#39;10<br>

</div>