[ZODB-Dev] repoze.catalog.query very slow

Thu Jan 3 22:25:32 UTC 2013

Dear gods, I hope you get an answer to this question, as I've noticed the same thing with very large indexes (using zc.catalog). I believe that at the root layers repoze.catalog is built around the same concepts and structures.

When I tried to trace down the problems with a profiler, they all revolved around loading the relevant portions of the indexes into memory. It had nothing to do with the final results of the query; had nothing to do with waking up the 'result' objects; all of the slowness seemed to be in loading the indexes themselves into memory. In one case, we were only using one index (a SetIndex), with about 150000 document ids.

This is from my own profiling. All I know is that this is very slow, and then very fast, and the object cache and the relevant indexes ability to keep all of their little BTrees or Buckets or Sets or whatever in that object cache seem to have a tremendous impact on query and response time - far more than is taken up by then waking up the content objects in your result set.

When the indexes aren't in memory, in my case I found the slowness to be in BTrees's 'multiunion' function; the real slowness was in calling ZODB's setstate (which is loading into memory). This is just BTree (catalog index) data being loaded at this point:

Profiling a fresh site (no object cache / memory population yet)
================================================================
winterhome-firstload.profile% callees multiunion
Function                      called...
                                  ncalls  tottime  cumtime
{BTrees._IFBTree.multiunion}  ->   65980    0.132   57.891  Connection.py:848(setstate)

winterhome-firstload.profile% callers multiunion
Function                      was called by...
                                  ncalls  tottime  cumtime
{BTrees._IFBTree.multiunion}  <-      27    0.348   58.239  index.py:203(apply)

(yep, 58 seconds; very slow ZEO network load in a demostorage setup where ZEO cannot update its client cache, which makes these setstate problems very exaggerated). 'multiunion' is called 27 times, but one of those times takes 58 seconds).

Profiling the same page again with everything all loaded
========================================================
winterhome-secondload.profile% callees multiunion
Function                      called...
                                  ncalls  tottime  cumtime
{BTrees._IFBTree.multiunion}  ->

winterhome-secondload.profile% callers multiunion
Function                      was called by...
                                  ncalls  tottime  cumtime
{BTrees._IFBTree.multiunion}  <-      27    0.193    0.193  index.py:203(apply)

(this time, multiunion doesn't require any ZODB loads, and its 27 calls internal time and cumulative time are relatively speedy)

If there's a good strategy for getting and keeping these things in memory, I'd love to know it; but when the catalog indexes are competing with all of the content objects that make up a site, it's hard to know what to do or even how to configure the object cache counts well without running into serious memory problems.

On Jan 3, 2013, at 2:50 PM, Claudiu Saftoiu <csaftoiu at gmail.com> wrote:

> Hello all,
> 
> Am I doing something wrong with my queries, or is repoze.catalog.query very slow?
> 
> I have a `Catalog` with ~320,000 objects and 17 `CatalogFieldIndex`es. All the objects are indexed and up to date. This is the query I ran (field names renamed):
> 
>     And(InRange('float_field', 0.01, 0.04),
>         InRange('datetime_field', seven_days_ago, today),
>         Eq('str1', str1),
>         Eq('str2', str2),
>         Eq('str3', str3),
>         Eq('str4', str4))
> 
> It returned 15 results so it's not a large result set by any means. The strings are like labels - there are <20 things any one of the string fields can be.
> 
> This query took a few minutes to run the first time. Re-running it again in the same session took <1 second each time. When I restarted the session it took only 30 seconds, and again 1 second each subsequent time.
> 
> What makes it run so slow? Is it that the catalog isn't fully in memory? If so, is there any way I can guarantee the catalog will be in memory given that my entire database doesn't fit in memory all at once?
> 
> Thanks,
> - Claudiu
> _______________________________________________
> For more information about ZODB, see http://zodb.org/
> 
> ZODB-Dev mailing list  -  ZODB-Dev at zope.org
> https://mail.zope.org/mailman/listinfo/zodb-dev

Thanks,
Jeff Shell
jeff at bottlerocket.net