P.S. About the data structures:<div><br></div><div>wordset is a freshly unpickled python set from my old sqlite oodb thingy.</div><div><br></div><div>The new docsets I'm keeping are 'L' arrays from the stdlib array module. I'm up for using ZODB's builtin persistent data structures if it makes a lot of sense to do so, but it sorta breaks my abstraction a bit and I feel like the memory issues I'm having are somewhat independent of the container data structures (as I'm having the same issue just with fixed size strings).</div>
<div><br></div><div>Thanks!</div><div>-Ryan<br><br><div class="gmail_quote">On Mon, May 10, 2010 at 5:16 PM, Ryan Noon <span dir="ltr"><<a href="mailto:rmnoon@gmail.com">rmnoon@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi all,<div><br></div><div>I've incorporated everybody's advice, but I still can't get memory to obey cache-size-bytes. I'm using the new 3.10 from pypi (but the same behavior happens on the server where I was using 3.10 from the new lucid apt repos).</div>
<div><br></div><div>I'm going through a mapping where we take one long integer "docid" and map it to a collection of long integers ("wordset") and trying to invert it into a mapping for each '"wordid" in those wordsets to a set of the original docids ("docset").</div>
<div><br></div><div>I've even tried calling cacheMinimize after every single docset append, but reported memory to the OS never goes down and the process continues to allocate like crazy.</div><div><br></div><div>I'm wrapping ZODB in a "ZMap" class that just forwards all the dictionary methods to the ZODB root and allows easy interchangeability with my old sqlite OODB abstraction.</div>
<div><br></div><div>Here's the latest version of my code, (minorly instrumented...see below):</div><div><br></div><div><div> try:</div><div> max_docset_size = 0</div><div class="im"><div> for docid, wordset in docid_to_wordset.iteritems():</div>
</div><div> for wordid in wordset:</div><div> if wordid_to_docset.has_key(wordid):</div><div> docset = wordid_to_docset[wordid]</div><div> else:</div>
<div>
docset = array('L')</div><div> docset.append(docid)</div><div> if len(docset) > max_docset_size:</div><div> max_docset_size = len(docset)</div>
<div> print 'Max docset is now %d (owned by wordid %d)' % (max_docset_size, wordid)</div><div> wordid_to_docset[wordid] = docset</div><div> wordid_to_docset.garbage_collect()</div>
<div> wordid_to_docset.connection.cacheMinimize()</div><div> </div><div> n_docs_traversed += 1</div><div><br></div><div> </div><div> if n_docs_traversed % 100 == 1:</div>
<div> status_tick()</div><div> if n_docs_traversed % 50000 == 1:</div><div> self.do_commit()</div><div> </div><div> self.do_commit()</div>
<div> except KeyboardInterrupt, ex:</div><div> self.log_write('Caught keyboard interrupt, committing...')</div><div> self.do_commit()</div><div><br></div><div>I'm keeping track of the greatest docset (which would be the largest possible thing not able to be paged out) and its only 10,152 longs (at 8 bytes each according to the array module's documentation) at the point 75 seconds into the operation when the process has allocated 224 MB (on a cache_size_bytes of 64*1024*1024).</div>
<div><br></div><div><br></div><div>On a lark I just made an empty ZMap in the interpreter and filled it with 1M unique strings. It took up something like 190mb. I committed it and mem usage went up to 420mb. I then ran cacheMinimize (memory stayed at 420mb). Then I inserted another 1M entries (strings keyed on ints) and mem usage went up to 820mb. Then I committed and memory usage dropped to ~400mb and went back up to 833mb. Then I ran cacheMinimize again and memory usage stayed there. Does this example (totally decoupled from any other operations by me) make sense to experienced ZODB people? I have really no functional mental model of ZODB's memory usage patterns. I love using it, but I really want to find some way to get its allocations under control. I'm currently running this on a Macbook Pro, but it seems to be behaving the same way on Windows and Linux.</div>
<div><br></div><div>I really appreciate all of the help so far, and if there're any other pieces of my code that might help please let me know.</div><div><br></div><div>Cheers,</div><div>Ryan</div><div><div></div><div class="h5">
<br><div class="gmail_quote">
On Mon, May 10, 2010 at 3:18 PM, Jim Fulton <span dir="ltr"><<a href="mailto:jim@zope.com" target="_blank">jim@zope.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>On Mon, May 10, 2010 at 5:39 PM, Ryan Noon <<a href="mailto:rmnoon@gmail.com" target="_blank">rmnoon@gmail.com</a>> wrote:<br>
> First off, thanks everybody. I'm implementing and testing the suggestions<br>
> now. When I said ZODB was more complicated than my solution I meant that<br>
> the system was abstracting a lot more from me than my old code (because I<br>
> wrote it and new exactly how to make the cache enforce its limits!).<br>
><br>
>> > The first thing to understand is that options like cache-size and<br>
>> > cache-size bytes are suggestions, not limits. :) In particular, they<br>
>> > are only enforced:<br>
>> ><br>
>> > - at transaction boundaries,<br>
><br>
> If it's already being called at transaction boundaries how come memory usage<br>
> doesn't go back down to the quota after the commit (which is only every 25k<br>
> documents?).<br>
<br>
</div>Because Python generally doesn't return memory back to the OS. :)<br>
<br>
It's also possible you have a problem with one of your data<br>
structures. For example if you have an array that grows effectively<br>
without bound, the array will have to be in memory, no matter how big<br>
it is. Also, if the persistent object holding the array isn't seen as<br>
changed, because you're appending to the array, then the size of the<br>
array won't be reflected in the cache size. (The size of objects in<br>
the cache is estimated from their pickle sizes.)<br>
<br>
I assume you're using ZODB 3.9.5 or later. If not, there's a bug in<br>
handling new objects that prevents cache suggestions from working<br>
properly.<br>
<br>
If you don't need list semantics, and set semantics will do, you might<br>
consider using an BTrees.LLBtree.TreeSet, which provides compact<br>
scalable persistent sets. (If your word ids can be signed, you could<br>
ise the IIBTree variety, which is more compact.) Given the variable<br>
name is wordset, then I assume you're dealing with sets. :)<br>
<br>
What is wordid_to_docset? You don't show it's creation.<br>
<br>
Jim<br>
<br>
--<br>
<font color="#888888">Jim Fulton<br>
</font></blockquote></div><br><br clear="all"><br></div></div><div class="im">-- <br>Ryan Noon<br>Stanford Computer Science<br>BS '09, MS '10<br>
</div></div>
</blockquote></div><br><br clear="all"><br>-- <br>Ryan Noon<br>Stanford Computer Science<br>BS '09, MS '10<br>
</div>