[Zope] ZCatalog performance issues - catalogging objects takes ages

Casey Duncan casey@zope.com
Mon, 31 Mar 2003 10:28:53 -0500


On Monday 31 March 2003 05:47 am, Wankyu Choi wrote:
> Dear All,
>=20
> May I have your expertise on this? ;-)=20
>=20
> As much as I'm new to Zope/Python, ZCatalog (Catalog) internals vex me =
even
> more.
>=20
> I have a message board product called NeoBoard, some of you might know.
> Recently I rewrote its core to have a built-in catalog for indexing art=
icles
> and displaying them automatically sorted on thread keys. It showed quit=
e a
> boost in performance. Previous versions without the built-in catalog us=
ed to
> ramrod all article objects into/out of memory whenever they need to dis=
play
> them. What a waste of memory and CPU power as Toby Dickenson suggested.=
=20
>=20
> Here's what I did to solve this problem:
>=20
> - Rewrote the parent class of the NeoBoard/NeoBoardArticle ( article
> container/article objects themselves ), NeoPortalElementContainer to in=
herit
> ZCatalog. Basically NeoPortalElementContainer automatically natural
> sorts/numbers objects (elements) when they're added to the container:
> page_1, page_2, ... etc.=20

Subclassing ZCatalog can be a maintenance headache. I did it for=20
DocumentLibrary and regretted it.

> - NeoBoardArticle looks toward NeoBoard when the catalog methods define=
d in
> NeoPortalElementContaier are called. So NeoBoard's catalog methods are
> always used no matter whereever you are in the path hierarchy.
> - When you call a NeoBoard instance, it calls ZCatalog's searchResults(=
),
> which returns brains objects. A threaded (expanded) look does  require =
a
> step further: NeoBoard sorts a pageful of threads and their replies bef=
ore
> returning them; it doesn't care about the other threads that are not
> displayed in the current request.=20
>=20
> Performance? Not so fast as SQL-backed PHP version ( displaying a pagef=
ul of
> threads takes only a fraction of a second ), but not bad.

Is this Zope 2.6.1? What do the queries look like?
=20
> Okay, I partially solved one problem ( wasting memory/horsepower, etc -=
 I'm
> still not satisfied with the performance, though ) but created another =
set
> of problems while so doing.  I could display 5,000 threads ( about 20,0=
00
> article obects incuding all replies to the threads) in less than a seco=
nd (
> it takes a bit more when you load the board for the first time. ) The
> problems are...

I would be interested in using this data as a benchmark for improvements =
in=20
2.7...
=20
> - It takes ages when cataloging even a small number of articles. 18 sec=
onds
> for cataloging 50 or so article objects with so little to index? Is it
> normal? Can't imagine recataloging 20,000 objects.  For example, if you=
 move
> a thread from one NeoBoard instance to another, you have to uncatalog t=
he
> thread including  all its replies in NeoBoad A and catalog them in NeoB=
oard
> B: cataloging a single article object takes more than 1 second. Don't t=
hink
> it's normal... Or is it?

Profiling may be necessary to pin this down. Likely culprets are textinde=
xes,=20
but its hard to say. Are you sure you are doing a minimum of work (i.e., =
only=20
indexing each message once)?
=20
> - When I attempt to uncatalog an object that's not been catalogged, Zop=
e
> spews out errors in the log. Can I supress the errors in code, which, i=
n my
> applications, are meaningless.=20

These errors are harmless. It might be better to check if they are catalo=
ged=20
first before uncataloging them.
=20
> - Catalogs sometimes do get corrupted so recatalogging is required from=
 time
> to time. Is it also normal? All of my article objects are catalog-aware=
 and
> they catalog/uncatalog/recatalog themselves when getting added, deleted=
, or
> modified using manage_afterAdd(), manage_beforeDelete() and CMF'ish _ed=
it()
> method. When a missing article (ghost catalog entry) causes a KeyError,
> NeoBoard attempts to refresh the catalog: well, takes too much time. Bu=
t
> manually recreating its catalog is not an alternative. Any ideas why th=
is'd
> happen? Any tips on maintaining catalog integrity?

Although there are have historically been BTree bugs that can cause KeyEr=
rors,=20
they have slowly been stamped out. It would be helpful to find a test cas=
e=20
that causes these key errors. Do these keyerrors happen at search time?

> - Here're the indexes NeoBoard uses:
>=20
>=20
>     security.declarePublic( 'enumerateIndexes' )
>     def enumerateIndexes( self ):
>         """
>             Return a list of ( index_name, type ) pairs for the initial
>             index set.
>         """=20
>         return ( ('Title', 'TextIndex')
>                , ('meta_type', 'FieldIndex')
>                , ('getSortKey', 'FieldIndex')       =20
>                , ('getThreadSortKey', 'FieldIndex')       =20
>                , ('isThreadParent', 'FieldIndex')                      =
=20
>                , ('creation_date', 'FieldIndex')       =20
>                , ('Creator', 'FieldIndex')
>                , ('CreatorEmail', 'FieldIndex')
>                , ('getArticleCategory', 'FieldIndex')
>                , ('getNeoPortalContentSearchText', 'TextIndex')
>                , ('getInlineCommentsSearchText', 'TextIndex')          =
    =20
>                , ('getInlineCommentCreators', 'TextIndex')             =
 =20
>                , ('getAttachmentsSearchText', 'TextIndex')
>                , ('getNeoPortalReadCount', 'FieldIndex')
>                , ('getNeoPortalNumContentRatings', 'FieldIndex')
>                , ('getNeoPortalElementNumber', 'FieldIndex')
>                , ('isTempNeoBoardArticle', 'FieldIndex')
>                )

I'm concerned that the CommentsSearchText and AttachmentsSearchText are=20
arbitrarily expensive. Maybe as a test try removing one index at a time t=
o=20
see if any one is causing a noticable performance decrease. Start with th=
e=20
TextIndexes.
=20
> I came to know that 'TextIndex' is deprecated. Have yet to try ZCTextIn=
dex
> or TextIndexNG ( the latter seems like an overkill). Found 'TopicIndex'=
 very
> interesting. Would they make much difference? Especially, I was suprise=
d to
> find the simple 'Title' index takes almost one full second when applied=
 on
> an object: that getIndex( name ) call alone in the Catalog.py takes thi=
s
> much. So I suspect it's not about Catalog but I'm doing something very
> stupid in setting up this built-in catalog.

That delay may be exposing an index bug. getIndex just does a single=20
dictionary lookup and wraps it, so I'm not sure why this should take a lo=
ng=20
time, unless the TextIndex object is taking a *long* time to load from th=
e=20
database. But its main ZODB record should not be very big.

I would definitely Try ZCTextIndex, just because its searching works so m=
uch=20
better.
=20
> ONE FINAL QUESTION: I strongly suspect I wouldn'[t be able to get any f=
aster
> using ZCatalog. At least not as fast as using RDBMS. I'm thinking... "N=
ot
> fast enough, not flexible enough since I can't perform sopnisticated qu=
eries
> on ZCatalog and stuff... why not revert to MySQL?" Got any thoughts on =
this?
> How does ZCatalog compare to a reasonably fast RDBMS?

One general suggestion: What is your ZODB cache set to? The default of 40=
0 is=20
*way* too small for heavy ZCatalog use. I would try upping it to 2000, ma=
ybe=20
higher (depending on RAM). Use the activity monitor to see how much readi=
ng=20
happens when you query and index. Upping the cache size can dramatically=20
reduce reading from disk. Going from 400 to 2000 gave me roughly a factor=
 of=20
10 improvement in one test I had querying ZCTextIndex. It also can=20
dramatically help index time since more of the lexicon and index BTrees c=
an=20
remain in memory.
=20
> NeoBoard (1.1) will be taken out of its beta phases when I solve this
> catalogging weirdness, and might start working on 1.2 using MySQL or SA=
PDB
> as backend. Hope somebody can persuave me out of this path... just the
> thought of having to rewrite the core to use SQL makes me
> shudder....arrrrrrgh...
>=20
> Any help, hints or comments would be much appreciated.  I do need to mo=
ve on
> with this project :-( It's been almost a year now...ouch. Weeks became
> months; months became a whole year... whew.

Yup, been there ;^)

-Casey