[Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Tue, 26 Jun 2001 14:52:17 +0200

Hi Giovanni, Chris and all others,

Chris McDonough wrote:
> 
> Hi Giovanni,
> 
> How many indexes do you have, what are the index types, and what do they
> index?  Likewise, what about metadata?  In your last message, you said
> there's about 20.  That's a heck of a lot of indexes.  Do you need them
> all?
> 
> I can see a potential reason for the problem you explain as "and I
> remind you that as the folder get populated, the size that is added to
> each transaction grows, a folder with one hundred objects adds some
> 100K"... It's true that "normal" folders (most ObjectManager-derived
> containers actually) cause database bloat within undoing storages when
> an object is added or removed from it.  This is because it keeps a list
> of contained subobject names in an "_objects" attribute, which is a
> tuple.  When an object is added, the tuple is rewritten in entirety.  So
> for instance, if you've got 100 items in your folder, and you add one
> more, you rewrite all the instance data for the folder itself, which
> includes the (large) _objects tuple (and of course, any other raw
> attributes, like properties).  Over time, this can be problematic.
> 
> Shane's BTreeFolder Product attempts to ameliorate this problem a bit by
> keeping the data that is normally stored in the _objects tuple in its
> own persistent object (a btree).
> 
> Are you breaking the content up into subfolders?  This is recommended.
> 
> I'm temped to postulate that perhaps your problem isn't as much ZCatalog
> as it is ObjectManager overhead.

Well, I'm not very familiar with the details about the sub-object
management of ObjectManager and friends. Moreover, I had yet a closer
look only into UnTextIndex, but not into UnIndex or UnKeywordIndex. So
take my comments with a grain of salt. 

A text index (class SearchIndex.UnTextIndex) is definetely is a cause of
bloating, if you use CatalogAware objects. An UnTextIndex maintains for
each word a list of documents, where this word appears. So, if a
document to be indexed contains, say, 100 words, 100 IIBTrees
(containing mappings documentId -> word score) will be updated. (see
UnTextIndex.insertForwardIndexEntry) If you have a larger number of
documents, these mappings may be quite large: Assume 10.000 documents,
and assume that you have 10 words which appear in 30% of all documents.
Hence, each of the IIBTrees for these words contains 3000 entries. (Ok,
one can try to keep this number of frequent words low by using a "good"
stop word list, but at least for German, such a list is quite difficult
to build. And one can argue that many "not too really frequent" words
should be indexed in order to allow more precise phrase searches)I don't
know the details, how data is stored inside the BTress, so I can give
only a rough estimate of the memory requirements: With 32 bit integers,
we have at least 8 bytes per IIBTree entry (documentId and score), so
each of the 10 BTree for the "frequent words" has a minimum length of
3000*8 = 24000 bytes. 

If you now add a new document containing 5 of these frequent words, 5
larger BTrees will be updated. [Chris, let me know, if I'm now going to
tell nonsense...] I assume that the entire updated BTrees = 120000 bytes
will be appended to the ZODB (ignoring the less frequent words) -- even
if the document contains only 1 kB text. 

This is the reason, why I'm working on some kind of "lazy cataloging".
My approach is to use a Python class (or Base class,if ZClasses are
involved), which has a method manage_afterAdd. This method looks for
superValues of a type like "lazyCatalog" (derived from ZCatalog), and
inserts self.getPhysicalPath() into the update list of each found
"lazyCatalog".

Later, a "lazyCatalog" can index all objects in this list. Then, then
bloating happens either in RAM (without subtransaction), or in a
temporary file, if you use subtransactions.

OK, another approach which fits better to your (Giovanni) needs might be
to use another data base than ZODB, but I'm afarid that even then
"instant indexing" will be an expensive process, if you have a large
number of documents.

Abel