[ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about
jim at zope.com
Mon Mar 26 15:55:33 EDT 2007
On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote:
> Jim Fulton wrote at 2007-3-25 09:53 -0400:
>> On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
>>> MF> I think one of the main limitations of the current catalog (and
>>> MF> hurry.query) is efficient support for sorting and batching the
>>> MF> results. The Zope 3 catalog returns all matching results, which
>>> can then
>>> MF> be sorted and batched. This will stop being scalable for large
>>> MF> collections. A relational database is able to do this
>>> internally, and is
>>> MF> potentially able to use optimizations there.
>> What evidence to you have to support this assertion? We did some
>> literature search on this a few years ago and found no special trick
>> to avoid sorting costs.
>> I know of 2 approaches to reducing sort cost:
>> 1. Sort your results based on the "primary key" and therefore, pick
>> your primary key to match your sort results. In terms of the Zope
>> catalog framework, the primary keys are the document IDs, which are
>> traditionally chosen randomly. You can pick your primary keys based
>> on a desired sort order instead. A variation on this theme is to use
>> multiple sets of document ids, storing multiple sets of ids in each
>> index. Of course, this approach doesn't help with something like
>> relevance ranks.
>> 2. Use an N-best algorithm. If N is the size of the batch and M is
>> the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
>> is a significant improvement if N << M, but still quite expensive.
> The major costs in sorting are usually not the "log(n)" but
> the very high linear costs fetching the sort keys (although for
> huge n,
> we will reach the asymptotic limits).
Right. The problem is the "N" not the log(N). :)
> Under normal conditions, a relational database can be far more
> to fetch values either from index structures or the data records
> than Zope -- as
> * its data representation is much more compact
> * it often supports direct access
> * the server itself can access and process all data.
> With the ZODB, the data is hidden in pickles (less compact), there is
> no direct access (instead the complete pickle need to be decoded)
The catalog sort index mechanism uses the un-index data structure in
the sort index to get sort keys. This is a pretty compact data
> all operations are done in the client (rather than in the server).
Which is often fine if the desired data are in the client cache. It
avoids making the storage a bottleneck.
Jim Fulton mailto:jim at zope.com Python Powered!
CTO (540) 361-1714 http://www.python.org
Zope Corporation http://www.zope.com http://www.zope.org
More information about the ZODB-Dev