[Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...

Chris McDonough chrism@zope.com
Mon, 10 Dec 2001 10:21:23 -0500


Excellent analysis, many thanks Sean!  This is much-needed info for
people whom are attempting to scale.

----- Original Message -----
From: <sean.upton@uniontrib.com>
To: <zope-dev@zope.org>
Sent: Sunday, December 09, 2001 10:36 PM
Subject: [Zope-dev] 100k+ objects, or...Improving Performance of
BTreeFolder...


> Interesting FYI for those looking to support lots of cataloged
objects in
> ZODB and Zope (Chris W., et al)... I'm working on a project to put
~350k
> Cataloged objects (customer database) in a single
BTreeFolder-derived
> container; these objects are 'proxy' objects which each expose a
single
> record in a relational dataset, and allow about 8 fields to be
indexed (2 of
> which, TextIndexes).
>
> Some informal stress tests using 100k+ _Cataloged_ objects in a
BTreeFolder
> in Zope 2.3.3 on my PIII/500/256mb laptop are proving to be
successful, but
> not without some stubborn investigation and a few caveats.
>
> BTreeFolder, using ObjectManager APIs, frankly, just won't scale for
> bulk-adds of objects to folders.  I was adding CatalogAware objects
to my
> folder (including index_object()). After waiting for bulk-add
processes to
> finish after running for 2 days, I killed Zope and started trying to
> optimize, figuring that the problem was related to Catalog and my
own RDB
> access code, and got nowhere (well, I tuned my app, bu this didn't
solve my
> problem), then went to #zope, got a few ideas, and ended up with the
> conclusion that my problem was not Catalog-related, but related to
> BTreeFolder; I initially thought it was a problem with the C-Based
generic
> BTree implementation scaling well past 10k objects, but felt I
couldn't
> point the finger at that before some more basic stuff was ruled out.
>
> The easiest thing to do in this case, was to figure out what was
heavily
> accessing the BTree via its dictionary-like interface, and the
thought
> occurred to me that there might be multiple has_key checks, security
stuff,
> and the like called by ObjectManager._setObject(), and I was right.
I
> figured a switch to use the simple BasicBTreeFolder._setOb() for my
stress
> tests might reveal an increase in speed, and...
>
> ...it works, acceptably, no less, on my slow laptop for 100,000
objects.  It
> took ~50 minutes to do this on meager hardware with a 4200 RPM ide
disk, and
> I figure a bulk add process like this on fast, new hardware (i.e.
something
> with upwards of 22k pystones and lots of RAM) with a dedicated
server for my
> RDB, would likely take 1/5th this time, or about 10 minutes (by
increasing
> both MySQL performance, and Zope performance); combine this with ZEO
and
> have a dedicated node do this, and I think this is a small amount of
proof
> of Zope's ability to scale to many objects. (See my caveats at the
bottom of
> this message, though).
>
> After days of frustration, I'm actually impressed by what I found:
My
> data-access APIs are very computationally expensive, since they
establish a
> MySQLdb cursor object for each call and execute a query; these data
access
> methods used in bulk adding 100k objects after using _setOb() during
> Cataloging via index_object() (the transaction done all in memory
for now,
> but likely moved to subtransactions soon to support up to 4x that
data).
>
> So far, the moral of the story: use _setOb(), not _setObject() for
this many
> objects!
>
> I haven't seen any material documenting anything like this for
BTreeFolder,
> so I figured I would share with zope-dev what I found in the hopes
that
> developers creating products with BTreeFolder and/or future
implementations
> of BTreeFolder might take this into account, in docs, if nothing
else.
>
> Caveats:
> - I'm using FileStorage and an old version of Zope (2.3.3).  I can't
say how
> this will perform with Python 2.1/Zope 2.[4/5].  I imagine that one
would
> want to pack the storage between full rebuilds or have very, very
fast
> storage hardware.
>
> - Catalog searches without any limiting queries to indexes will
simply be
> too slow for practical use with this many objects, so they need to
be
> forbidden with a permission to prevent accidental over-utilization
of system
> resources or DOS-style attacks.  Otherwise, Catalog searches on my
slow hard
> drive seem acceptable.
>
> - I'm not too concerned with BTreeFolder __getattr__() performance
> penalties, though I modified BTreeFolder.__getattr__ just in case to
remove
> the 'if tree and tree.has_key(name)', replacing with try/except; I'm
not
> sure if this helps/hinders, because my stress-test code uses
_getOb()
> instead.
>
> - objectIds() doesn't work; or, more accurately, at first glance,
<dtml-var
> "_.len(objectIds())"> doesn't work; I haven't tested anything else.
I would
> like to find out why this is, and fix it.  I suppose that there is
something
> done in ObjectManager that BTreeFolder's simple _setOb() doesn't do.
If
> anyone wants to help me figure out the obvious here, I'd appreciate
it. ;)
>
> - I don't think un-indexed access of records is likely to be very
practical
> with this many, esp. if things like objectIds() are broken, which
increases
> the value of Catalog, and I think that what my experiences here with
this
> project are showing is that Catalog indexing isn't as expensive/slow
as I
> initially thought it would be.  That said, I'm sure there can be
> improvements in Catalog as often is discussed here recently, but for
now, I
> think I'm happy. :)
>
> - I Haven't compared these results with OFS.Folder.Folder yet.  I'm
too
> lazy/busy to comparison test.
>
> - I'm relatively sure that, in my app, the text index BTrees in the
Catalog
> are very 'bushy' (more so than normal) because I am indexing
people's full
> names, and street addresses, which means there are less common words
than
> indexing, say, an every-day document.
>
> - Also, I want to make it clear that if I had a data access API that
needed
> more than simple information about my datasets (i.e. I was trying to
do
> reporting on patterns, like CRM-ish types of applications), I would
likely
> wrap a function around indexes done in the RDB, not in Catalog.  My
requires
> no reporting functionality, and thus really needs no indexes, other
than for
> finding a record for customer service purposes and account
validation
> purposes.  The reason, however, that I chose ZCatalog was for full
text
> indexing that I could control/hack/customize easily.  My slightly
uninformed
> belief now is that for big datasets or "enterprise" applications
(whatever
> that means), I would use a hybrid set of (faster) indexes using the
RDB's
> indexes where appropriate (heavily queried fields), and ZCatalog for
> TextIndexes (convenient).   I'm sure inevitable improvements to
ZCatalog
> (there seems to be community interest in such) will help here.
>
> - I wonder if "directory-storage" combined with ReiserFS might make
for an
> interesting future ZODB choice for this sort of app.
>
> Sean
>
> =========================
> Sean Upton
> Senior Programmer/Analyst
> SignOnSanDiego.com
> The San Diego Union-Tribune
> 619.718.5241
> sean.upton@uniontrib.com
> =========================
>
> _______________________________________________
> Zope-Dev maillist  -  Zope-Dev@zope.org
> http://lists.zope.org/mailman/listinfo/zope-dev
> **  No cross posts or HTML encoding!  **
> (Related lists -
>  http://lists.zope.org/mailman/listinfo/zope-announce
>  http://lists.zope.org/mailman/listinfo/zope )
>