[Zope-dev] ZCatalog, import errors, and indexing errors

Michel Pelletier michel@digicool.com
Mon, 06 Mar 2000 13:17:26 -0800


"R. David Murray" wrote:
> 
> Well, I may have been foolhardy.  Based on various comments on this list
> I did not think I would have any trouble importing a 60,000 record
> database into Zope and ZCataloging it.  Having the records in Zope
> as objects made a lot more sense for this project than putting them
> into a backend RDBMS.
> 
> The first problem I ran into was during the import.  I wrote an
> external method that read the data from a tab delimited file
> and used the data to build ZClass instances.  I tried to create
> them all in one folder in one transaction.  This failed miserably
> at somewhere around 1500 records.  I got an error about an
> 'frexp' call being out of range, somewhere in ZCatalog's BTree
> methods (my apologies for not having captured the exact error; maybe
> I can reproduce it once I've finished building my new test system).

I've never heard of frexp, perhaps it's a C call.  In any case, BTrees
should scale up to 1500 records easily.  If you can reproduce it, I'd
like to see it.
 
> So I tried loading the records in batches, and at first that seemed
> to work.  Then I got an out of memory error.  After noticing that
> trying to view that directory in the management interface threw both
> my browser and Zope into fits, I decided to try loading the records
> into multiple folders. 

The current implementation of ObjectManager is not designed to scale
above the order of hundreds of objects.  For this, you would need an
ObjectManager based on a more scalable data structure than a python
dictionary, like a BTree.  This is because in order to work with the
data stucture the entire python dictionary must be loaded into memory. 
This is why you got the memory error.  A BTree does not need to be
entirly loaded into memory to work with it.

> This also seemed to work.  I used 1000
> record batches.  But occasionally I would get the frexp error.  If
> I tried the load several times, it would eventually complete without
> error.  Loading other batches in between tries seemed to help, but
> that may be an illusion.  My load method ended up with the occasional
> short file, and I got this frexp error with batches as short as
> 300 records.  So I think the error has something to do with the
> catalog machinery and not the batch size.
> 
> So I finally got the database loaded, and everything seemed to be
> working.  However, we have just discovered that certain keywords
> do not appear to be yeilding the expected results on searches.
> The ZClass is catalog aware, and there are a few fields being indexed.
> The one of concern is just called 'keywords'.  Despite its name it
> is a text index, so that we can take advantage of the ability to
> do 'AND'ed searches.

Text indexes split, stem and stop (remove) words to conserve space. 
This behavior is probably throwing you off.  If you want to index purely
'split' atomic values (a concept that is still very language specific)
then you should use field indexes and do your own intersections (ANDs)
or wait for the next major release of Zope so that you can subclass your
own kind of Vocabulary object to control the splitting stemming,
stopword and synonym behavior of your catalogs.  Or you can check out a
CVS version and begin playing with it now and helping me debug these new
features.  The more people help, the faster they will become stable.

> One more piece of info that may or may not be important.  There are
> actually two ZClasses: the one holding the database records, and another
> class with a property of the same name (keywords).  Instances of this
> second class get added by hand.
> 
> Now we find that when we enter certain keywords (the examples we have
> found so far are 'well' and 'fire', which you probably don't need
> to know)

'well' is a stop word and is removed by the splitter.  'fire' is not, it
should be found.

> on one of these by-hand ZClass instences, they are *not* found
> by a ZCatalog search on the keywords field index.  Other words
> entered into the keywords field do cause the record to be found
> ('wells', for example).

> Now, I'm guessing no one is going to have an answer for me.  What I'm
> hoping for is some tips for how to go about debugging this.  Actually,
> what I'm really hoping is that someone from DC will view this as
> an important enough bug that they'll ask for a login on my system
> to check it out through <grin>. 

I'd like to research it, but certainly don't have the time for that.

> I have a gut level feeling that
> the frexp error and the index failure are related, so I also may
> try a total reload of the database, if someone can answer question
> (2) below.
> 
> To sumarize, this experience has raised several questions in my
> mind:
> 
> 1) what is the practical limit on the number of entries in a folder,
>         and is there some way to get around this for instances where
>         you want to use the ZODB as a database for a large number
>         of records?

You need a more practical kind of ObjectManager, like one based on
BTree.

> 2) is there a practical limit on the number of changes that can be
>         part of a transaction (eg: *should* I be able to add
>         60K objects in one transaction?)

Yes, all changes in a transaction are kept in virtual memory.  This is
your practical limit.

Otherwise, you can use subtransactions (which the Catalog allready uses)
then your practical limit is your amount of virtual memory + your amount
of temp space.

> 3) what is the best way to do a massive data load?

That depends on where and how you load it, I think you've discovered all
the tricks.

> 4) how does one go about debugging ZCatalog?  (I've read the debugger
>         doc posted here, and I'll see if that is enough to allow me to
>         get started on this)

Yes, it should help.  Try setting a breakpoint right at the
catalog_object and step through that procedure.  It's a bit complex but
you'll get the idea after a few hundred iterations. ;)

-Michel