[Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Tue, 26 Jun 2001 18:33:43 +0200

We think that Abel is absolutely right:

if in the same almost empty folder we add and catalog an object with one
word (and now we have optimized and reduced the number of indexes to 11) it
make a transaction of 73K, while if the object contains 300 words with the
same other indexes or properties, the transaction is 224K, and if all is the
same but the object contains 535 words the transaction is 331K.

And we are using now a catalog with only some 3000 document indexed with a
medium lenght of each document around 1K.

-giovanni

> Well, I'm not very familiar with the details about the sub-object
> management of ObjectManager and friends. Moreover, I had yet a closer
> look only into UnTextIndex, but not into UnIndex or UnKeywordIndex. So
> take my comments with a grain of salt.
>
> A text index (class SearchIndex.UnTextIndex) is definetely is a cause of
> bloating, if you use CatalogAware objects. An UnTextIndex maintains for
> each word a list of documents, where this word appears. So, if a
> document to be indexed contains, say, 100 words, 100 IIBTrees
> (containing mappings documentId -> word score) will be updated. (see
> UnTextIndex.insertForwardIndexEntry) If you have a larger number of
> documents, these mappings may be quite large: Assume 10.000 documents,
> and assume that you have 10 words which appear in 30% of all documents.
> Hence, each of the IIBTrees for these words contains 3000 entries. (Ok,
> one can try to keep this number of frequent words low by using a "good"
> stop word list, but at least for German, such a list is quite difficult
> to build. And one can argue that many "not too really frequent" words
> should be indexed in order to allow more precise phrase searches)I don't
> know the details, how data is stored inside the BTress, so I can give
> only a rough estimate of the memory requirements: With 32 bit integers,
> we have at least 8 bytes per IIBTree entry (documentId and score), so
> each of the 10 BTree for the "frequent words" has a minimum length of
> 3000*8 = 24000 bytes.
>
> If you now add a new document containing 5 of these frequent words, 5
> larger BTrees will be updated. [Chris, let me know, if I'm now going to
> tell nonsense...] I assume that the entire updated BTrees = 120000 bytes
> will be appended to the ZODB (ignoring the less frequent words) -- even
> if the document contains only 1 kB text.
>
> This is the reason, why I'm working on some kind of "lazy cataloging".
> My approach is to use a Python class (or Base class,if ZClasses are
> involved), which has a method manage_afterAdd. This method looks for
> superValues of a type like "lazyCatalog" (derived from ZCatalog), and
> inserts self.getPhysicalPath() into the update list of each found
> "lazyCatalog".
>
> Later, a "lazyCatalog" can index all objects in this list. Then, then
> bloating happens either in RAM (without subtransaction), or in a
> temporary file, if you use subtransactions.
>
> OK, another approach which fits better to your (Giovanni) needs might be
> to use another data base than ZODB, but I'm afarid that even then
> "instant indexing" will be an expensive process, if you have a large
> number of documents.
>
> Abel