[Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Tue, 26 Jun 2001 09:31:02 -0400

abel deuring wrote:
> A text index (class SearchIndex.UnTextIndex) is definetely is a cause of
> bloating, if you use CatalogAware objects. An UnTextIndex maintains for

Right.. if you don't use CatalogAware, however, and don't unindex before
reindexing an object, you should see a huge bloat savings, because the
only things which are supposed to be updated then are indexes and
metadata which have data that has changed.

> each word a list of documents, where this word appears. So, if a
> document to be indexed contains, say, 100 words, 100 IIBTrees
> (containing mappings documentId -> word score) will be updated. (see
> UnTextIndex.insertForwardIndexEntry) If you have a larger number of
> documents, these mappings may be quite large: Assume 10.000 documents,
> and assume that you have 10 words which appear in 30% of all documents.
> Hence, each of the IIBTrees for these words contains 3000 entries. (Ok,
> one can try to keep this number of frequent words low by using a "good"
> stop word list, but at least for German, such a list is quite difficult
> to build. And one can argue that many "not too really frequent" words
> should be indexed in order to allow more precise phrase searches)I don't
> know the details, how data is stored inside the BTress, so I can give
> only a rough estimate of the memory requirements: With 32 bit integers,
> we have at least 8 bytes per IIBTree entry (documentId and score), so
> each of the 10 BTree for the "frequent words" has a minimum length of
> 3000*8 = 24000 bytes.
> 
> If you now add a new document containing 5 of these frequent words, 5
> larger BTrees will be updated. [Chris, let me know, if I'm now going to
> tell nonsense...] I assume that the entire updated BTrees = 120000 bytes
> will be appended to the ZODB (ignoring the less frequent words) -- even
> if the document contains only 1 kB text.

Nah... I don't think so.  At least I hope not!  Each bucket in a BTree
is a separate persistent object.  So only the sum of the data in the
updated buckets will be appended to the ZODB.  So if you add an item to
a BTree, you don't add 24000+ bytes for each update.  You just add the
amount of space taken up by the bucket... unfortunately I don't know
exactly how much this is, but I'd imagine it's pretty close to the
datasize with only a little overhead.

> This is the reason, why I'm working on some kind of "lazy cataloging".
> My approach is to use a Python class (or Base class,if ZClasses are
> involved), which has a method manage_afterAdd. This method looks for
> superValues of a type like "lazyCatalog" (derived from ZCatalog), and
> inserts self.getPhysicalPath() into the update list of each found
> "lazyCatalog".
> 
> Later, a "lazyCatalog" can index all objects in this list. Then, then
> bloating happens either in RAM (without subtransaction), or in a
> temporary file, if you use subtransactions.
> 
> OK, another approach which fits better to your (Giovanni) needs might be
> to use another data base than ZODB, but I'm afarid that even then
> "instant indexing" will be an expensive process, if you have a large
> number of documents.

Another option is to use a session manager, and update the catalog at
session-end.

- C