[Zope] ZCatalog performance issues - catalogging objects takes ages

Wankyu Choi wankyu@neoqst.com
Mon, 31 Mar 2003 19:47:09 +0900

Dear All,

May I have your expertise on this? ;-)=20

As much as I'm new to Zope/Python, ZCatalog (Catalog) internals vex me =

I have a message board product called NeoBoard, some of you might know.
Recently I rewrote its core to have a built-in catalog for indexing =
and displaying them automatically sorted on thread keys. It showed quite =
boost in performance. Previous versions without the built-in catalog =
used to
ramrod all article objects into/out of memory whenever they need to =
them. What a waste of memory and CPU power as Toby Dickenson suggested.=20

Here's what I did to solve this problem:

- Rewrote the parent class of the NeoBoard/NeoBoardArticle ( article
container/article objects themselves ), NeoPortalElementContainer to =
ZCatalog. Basically NeoPortalElementContainer automatically natural
sorts/numbers objects (elements) when they're added to the container:
page_1, page_2, ... etc.=20
- NeoBoardArticle looks toward NeoBoard when the catalog methods defined =
NeoPortalElementContaier are called. So NeoBoard's catalog methods are
always used no matter whereever you are in the path hierarchy.
- When you call a NeoBoard instance, it calls ZCatalog's =
which returns brains objects. A threaded (expanded) look does  require a
step further: NeoBoard sorts a pageful of threads and their replies =
returning them; it doesn't care about the other threads that are not
displayed in the current request.=20

Performance? Not so fast as SQL-backed PHP version ( displaying a =
pageful of
threads takes only a fraction of a second ), but not bad.

Okay, I partially solved one problem ( wasting memory/horsepower, etc - =
still not satisfied with the performance, though ) but created another =
of problems while so doing.  I could display 5,000 threads ( about =
article obects incuding all replies to the threads) in less than a =
second (
it takes a bit more when you load the board for the first time. ) The
problems are...

- It takes ages when cataloging even a small number of articles. 18 =
for cataloging 50 or so article objects with so little to index? Is it
normal? Can't imagine recataloging 20,000 objects.  For example, if you =
a thread from one NeoBoard instance to another, you have to uncatalog =
thread including  all its replies in NeoBoad A and catalog them in =
B: cataloging a single article object takes more than 1 second. Don't =
it's normal... Or is it?

- When I attempt to uncatalog an object that's not been catalogged, Zope
spews out errors in the log. Can I supress the errors in code, which, in =
applications, are meaningless.=20

- Catalogs sometimes do get corrupted so recatalogging is required from =
to time. Is it also normal? All of my article objects are catalog-aware =
they catalog/uncatalog/recatalog themselves when getting added, deleted, =
modified using manage_afterAdd(), manage_beforeDelete() and CMF'ish =
method. When a missing article (ghost catalog entry) causes a KeyError,
NeoBoard attempts to refresh the catalog: well, takes too much time. But
manually recreating its catalog is not an alternative. Any ideas why =
happen? Any tips on maintaining catalog integrity?

- Here're the indexes NeoBoard uses:

    security.declarePublic( 'enumerateIndexes' )
    def enumerateIndexes( self ):
            Return a list of ( index_name, type ) pairs for the initial
            index set.
        return ( ('Title', 'TextIndex')
               , ('meta_type', 'FieldIndex')
               , ('getSortKey', 'FieldIndex')       =20
               , ('getThreadSortKey', 'FieldIndex')       =20
               , ('isThreadParent', 'FieldIndex')                      =20
               , ('creation_date', 'FieldIndex')       =20
               , ('Creator', 'FieldIndex')
               , ('CreatorEmail', 'FieldIndex')
               , ('getArticleCategory', 'FieldIndex')
               , ('getNeoPortalContentSearchText', 'TextIndex')
               , ('getInlineCommentsSearchText', 'TextIndex')            =
               , ('getInlineCommentCreators', 'TextIndex')               =

               , ('getAttachmentsSearchText', 'TextIndex')
               , ('getNeoPortalReadCount', 'FieldIndex')
               , ('getNeoPortalNumContentRatings', 'FieldIndex')
               , ('getNeoPortalElementNumber', 'FieldIndex')
               , ('isTempNeoBoardArticle', 'FieldIndex')

I came to know that 'TextIndex' is deprecated. Have yet to try =
or TextIndexNG ( the latter seems like an overkill). Found 'TopicIndex' =
interesting. Would they make much difference? Especially, I was suprised =
find the simple 'Title' index takes almost one full second when applied =
an object: that getIndex( name ) call alone in the Catalog.py takes this
much. So I suspect it's not about Catalog but I'm doing something very
stupid in setting up this built-in catalog.

ONE FINAL QUESTION: I strongly suspect I wouldn'[t be able to get any =
using ZCatalog. At least not as fast as using RDBMS. I'm thinking... =
fast enough, not flexible enough since I can't perform sopnisticated =
on ZCatalog and stuff... why not revert to MySQL?" Got any thoughts on =
How does ZCatalog compare to a reasonably fast RDBMS?

NeoBoard (1.1) will be taken out of its beta phases when I solve this
catalogging weirdness, and might start working on 1.2 using MySQL or =
as backend. Hope somebody can persuave me out of this path... just the
thought of having to rewrite the core to use SQL makes me

Any help, hints or comments would be much appreciated.  I do need to =
move on
with this project :-( It's been almost a year now...ouch. Weeks became
months; months became a whole year... whew.

Thanks in advance.

  Wankyu Choi
  NeoQuest Communications, Inc.
---------------------------------------------------------------  =20