[Zope] ZCatalog performance issues - catalogging objects takes ages

Wankyu Choi wankyu@neoqst.com
Tue, 1 Apr 2003 02:02:29 +0900


So glad to catch your attention :-)

>Subclassing ZCatalog can be a maintenance headache. I did it for=20
DocumentLibrary and regretted it.

Can you expound on this? In fact, I did just the opposite. First, I =
tried
"self._np_catalog =3D NeoPortalCatalog()" where NeoPortalCatalog is a =
subclass
of ZCatalog. Thought better ( or worse should I say :-) of it, and =
inherited
directly from NeoPortalCatalog. Thought it would be easier in terms of
maintenance.

>> Performance? Not so fast as SQL-backed PHP version ( displaying a=20
>> pageful of threads takes only a fraction of a second ), but not bad.

> Is this Zope 2.6.1? What do the queries look like?

Zope 2.6.1 with Python 2.2. Heard Python 2.2 works just fine with Zope =
2.6.1
and has had no obvious problems yet.

Here's the code block that returns the query results and sorts a portion =
of
them if necessary ( a non-threaded look doesn't require this sort. )

security.declarePrivate( '_getArticles' )               =20
    def _getArticles( self
                    , top=3DNone
                    , expand_all=3D0
                    , skey=3DNone
                    , reverse=3D0
                    , default_reverse=3D0
                    , search_fields=3D''
                    , search_keywords=3D''
                    , b_start=3D0
                    , limit=3DNone ):
        """
            Return a complete list of brains objects from the Catalog
        """
                  =20
        if search_fields and search_keywords:

            index_map =3D {'title': 'Title'
                    , 'creator':'Creator'
                    , 'creator_email':'CreatorEmail'
                    , 'body':'getNeoPortalContentSearchText'
                    , 'category':'getArticleCategory'
                    , 'comments':'getInlineCommentsSearchText'
                    , 'comment_creators':'getInlineCommentCreators'
                    , 'attachments': 'getAttachmentsSearchText'}   =20
               =20
            new_indexes =3D []
            for index in search_fields.keys():
                new_indexes.append( index_map[index] )

            results =3D self.search( indexes=3Dnew_indexes,
keywords=3Dsearch_keywords.split( ' ' )  )

        else:
            # threads are automatically reverse-sorted on sort_keys

            reverse =3D int( reverse )
            result_limit =3D None
            b_start =3D int( b_start )
            if limit is not None:
                limit =3D int( limit )
                result_limit =3D b_start + limit

            if skey:
                if skey=3D=3D'np_read_count': skey =3D =
'getNeoPortalReadCount'
                elif skey=3D=3D'np_num_ratings': skey =3D
'getNeoPortalNumContentRatings'
                else: skey =3D 'creation_date'
                sort_order =3D ''               =20
                if reverse: sort_order =3D 'reverse'
                results =3D self.searchResults(
meta_type=3DNeoBoardArticle.meta_type, isTempNeoBoardArticle=3D0, =
sort_on=3Dskey,
sort_order=3Dsort_order, limit =3D result_limit )
               =20
            else:
                expand_all =3D int( expand_all )
                current_thread =3D None
                if expand_all:
                    results =3D self.searchResults(
meta_type=3DNeoBoardArticle.meta_type, isTempNeoBoardArticle=3D0,
sort_on=3D'getSortKey', limit =3D result_limit )       =20
                else:
                    results =3D self.searchResults(
meta_type=3DNeoBoardArticle.meta_type, isTempNeoBoardArticle=3D0,
isThreadParent=3D( not expand_all ), sort_on=3D'getSortKey', limit =3D
result_limit )                   =20
                    # also pull all the replies to the current article
                    if top is not None and top.meta_type =3D=3D
NeoBoardArticle.meta_type:
                        top =3D top.getThreadParent()
                        sort_key =3D top.getSortKey()
                        current_thread =3D self.searchResults(
meta_type=3DNeoBoardArticle.meta_type, isTempNeoBoardArticle=3D0,
getSortKey=3Dsort_key, isThreadParent=3D0, sort_on=3D'getThreadSortKey' =
)
                       =20
                if limit is not None:
               =20
                    # sort only the meaningful portion of the results
specified by 'limit'
   =20
                    first_half =3D []
                    second_half =3D []
                   =20
                    first_half =3D results[: b_start ]
                                   =20
                    middle =3D results[ b_start : ( b_start + limit ) ]
                   =20
                    if not expand_all and current_thread is not None:=20
                        middle[:] =3D middle[:] + current_thread[:]
                   =20
                    second_half =3D results[ b_start + limit : ]
                   =20
                    middle =3D self.sortArticleThreads( middle )
                               =20
                    results =3D first_half + middle + second_half
           =20
        notices =3D self.getNotices()
        if len( notices ) > 0: results =3D notices[:] + results[:]
        return results
           =20
    security.declarePublic( 'sortAritlceThreads' )               =20
    def sortArticleThreads( self, brains ):
        """
            Sort a list of brains=20
        """

        import operator=20

        temp_list =3D map( lambda x: ( getattr( x.getObject(), =
'_sort_key' ),
getattr( x.getObject(), '_thread_sort_key' ), x ), brains )
              =20
        temp_list.sort()
        brains[:] =3D map( operator.getitem, temp_list, ( -1, ) * len(
temp_list ) )
         =20
        return brains

- isTempNeoBoardArticle: NeoBoard does what CMF does. When a user posts =
an
article, it first creates a temp article and examines it. If =
unacceptable,
deletes it. The method tells if the article is a temporary one or not.
- isThreadParent: tells if the article is the top-most one in the =
thread.
- expand_all: Boolean value to tell if we need a threaded-look.
- getSortKey: returns thread sorting key. It's the inversed article =
number:
article 50's sort key  becomes -50 when added to the board for automatic
reverse-sorting. ( Tried creation_date once, but it turned out to be a
disaster when you do import/export.)
- result_limit: calculated on the basis of the current batch.

If you want to take a look at it in context, I've got a viewCVS set up =
here:
http://cvs.zoper.net:3333/cgi-bin/viewcvs.cgi/NeoBoard/NeoBoard.py

And you can see the board in action here:
http://www.zoper.net/Boards/qa/view

I learned while reading docs on ZCatalog that I'd get better results by
adding meta data to brain objects. Will remove that expensive sorting =
method
soon.

>> another set of problems while so doing.  I could display 5,000 =
threads=20
>> ( about 20,000 article obects incuding all replies to the threads) in =

>> less than a second ( it takes a bit more when you load the board for=20
>> the first time. ) The problems are...

> I would be interested in using this data as a benchmark for =
improvements
in=20
2.7...

Took me a whole day to generate these articles; had fun with them for =
about
a week and lost them last night when the board's catalog went crazy with
missing keys; I had to remove the board and the data went with it :-(

On a different note. Creating an article object doesn't require that =
much
computation power. Just a bunch of init values for its properties. But
instantiating an article in a for loop, for example, takes more than a
second and it gets worse as the loop goes on. Is it because ZODB's
transaction/version support? Normally, how long should it take to =
generate
20,000 not-so-heavy objects? Taking more than an hour seems not right =
with
enough horsepower. When creating those test data, I had to take a long =
nap
:-(
=20
>> cataloging a single article object takes more than 1 second. Don't =
think
>> it's normal... Or is it?

>Profiling may be necessary to pin this down. Likely culprets are
textindexes,=20
>but its hard to say. Are you sure you are doing a minimum of work =
(i.e.,
only=20
>indexing each message once)?

I used to re-render the contents of articles when copying/moving them to
another instance of NeoBoard, but removed the code since it took too =
long.
It doesn't do anything now except cataloging/uncatalogging moved/copied
articles. Excerpts from NeoBoardArticle's manage_afterAdd():

if item is self and item.isThreadParent():       =20
            item._setSortKey()
            item._setThreadSortKey()
            neoboard =3D item.getNeoBoard()
            neoboard.addToNeoPortalElementContainerCatalog( item )
           =20
            articles =3D [ obj for id, obj in item.ZopeFind( item,
obj_metatypes =3D [ NeoBoardArticle.meta_type, ], search_sub=3D1 ) ]

            for article in articles:
                article._setSortKey()
                article._setThreadSortKey()
                neoboard.addToNeoPortalElementContainerCatalog( article =
)

I had to rename the indexObject method to
'addToNeoPortalElementContainerCatalog()' to prevent confusion with CMF.
Nothing fancy there; an exact copy of ZCatalog's.

>> - When I attempt to uncatalog an object that's not been catalogged,=20
>> Zope spews out errors in the log. Can I supress the errors in code,=20
>> which, in my applications, are meaningless.

> These errors are harmless. It might be better to check if they are
cataloged=20
> first before uncataloging them.

Guess I was stupid in thinking I'd save some time by skipping the check =
;-)
=20
> Although there are have historically been BTree bugs that can cause
KeyErrors,=20
> they have slowly been stamped out. It would be helpful to find a test =
case

> that causes these key errors. Do these keyerrors happen at search =
time?

No, I meant a key error in 'mybrain.getObject()', that is, a ghost entry =
in
the Catalog without the corresponding object. Guess it happens after a
massive set of additions or deletions. Can't pinpoint a case. Fast =
reloads
sometimes do generate ZODB conflict errors. If you reload while =
reindexing
everything with heavy disk I/O, you usually get these ZODB conflicts. =
Maybe
I should do some work on conflict resolutions?

> I'm concerned that the CommentsSearchText and AttachmentsSearchText =
are=20
> arbitrarily expensive. Maybe as a test try removing one index at a =
time to

> see if any one is causing a noticable performance decrease. Start with =
the

> TextIndexes.

Actually not. My test article objects didn't have any inline comments, =
for
example. I tried a single index after removing all the other indexes,
getNeoPortalSearchText, which returns title + body text. No obvious
improvements in performance. The funny thing is the test articles didn't
have much body: just a couple of words ( 'test artice' ).

> That delay may be exposing an index bug. getIndex just does a single=20
> dictionary lookup and wraps it, so I'm not sure why this should take a
long=20
> time, unless the TextIndex object is taking a *long* time to load from =
the

> database. But its main ZODB record should not be very big.

I was wrong. I looked into the Catalog.py more closely and it was not =
the
getIndex() call that was taking too long, but each index was. For =
example,
'Title' is a textindex and 'creation_date' is a DateIndex. Why would a
TextIndex take that long to index an article with a simple title 'test'? =
The
DateIndex is also painfully slow.=20

> I would definitely Try ZCTextIndex, just because its searching works =
so
much better.

Will try :-)

> One general suggestion: What is your ZODB cache set to?=20

I'm running these tests both on my desktop Linux box and on a set of
enterprise servers.

Desktop: Pentium 4, RH Linux 8.0 with all the latest errata applied, =
512M
RAM, cache set to 8,000, FileStorage

Enterprise servers:=20

- ZEO Storage Server: dual Xeon P4, RH Linux 8.0 with all the latest =
errata
applied, 4G RAM, 430G SCSI HDDs with RAID 5, DirectoryStorage on =
ReiserFS
with noatime on
- ZEO Client: dual P3 tualatin, RH Linux 8.0 with all the latest errata
applied, 2G RAM with ZODB cache set to 20,000.

Both my desktop and the ZEO client show the same symptoms. The ZEO =
servers
render CMF/Plone + NeoBoard pages in an average of 0.3 ~ 0.5 second, so =
I
don't think there's any hardware/cache problems.

>> Any help, hints or comments would be much appreciated.  I do need to=20
>> move on with this project :-( It's been almost a year now...ouch.=20
>> Weeks became months; months became a whole year... whew.

> Yup, been there ;^)

Been there too many times with other tools. Just hoped this time would =
be
different with Zope :-)=20

Thanks for your help.

---------------------------------------------------------------
  Wankyu Choi
  CEO/President
  NeoQuest Communications, Inc.
  http://www.zoper.net
  http://www.neoboard.net
---------------------------------------------------------------  =20