[ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

Mon Mar 26 15:28:23 EDT 2007

Jim Fulton wrote at 2007-3-25 09:53 -0400:
>
>On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
>> MF> I think one of the main limitations of the current catalog (and
>> MF> hurry.query) is efficient support for sorting and batching the  
>> query
>> MF> results. The Zope 3 catalog returns all matching results, which  
>> can then
>> MF> be sorted and batched. This will stop being scalable for large
>> MF> collections. A relational database is able to do this  
>> internally, and is
>> MF> potentially able to use optimizations there.
>
>What evidence to you have to support this assertion?  We did some  
>literature search on this a few years ago and found no special trick  
>to avoid sorting costs.
>
>I know of 2 approaches to reducing sort cost:
>
>1. Sort your results based on the "primary key" and therefore, pick  
>your primary key to match your sort results.  In terms of the Zope  
>catalog framework, the primary keys are the document IDs, which are  
>traditionally chosen randomly.  You can pick your primary keys based  
>on a desired sort order instead. A variation on this theme is to use  
>multiple sets of document ids,  storing multiple sets of ids in each  
>index.  Of course, this approach doesn't help with something like  
>relevance ranks.
>
>2. Use an N-best algorithm.  If N is the size of the batch and M is  
>the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which  
>is a significant improvement if N << M, but still quite expensive.

The major costs in sorting are usually not the "log(n)" but
the very high linear costs fetching the sort keys (although for huge n,
we will reach the asymptotic limits).

Under normal conditions, a relational database can be far more efficient
to fetch values either from index structures or the data records
than Zope -- as

  * its data representation is much more compact

  * it often supports direct access

  * the server itself can access and process all data.

With the ZODB, the data is hidden in pickles (less compact), there is
no direct access (instead the complete pickle need to be decoded) and
all operations are done in the client (rather than in the server).

That said, I fear we can nothing do in ZODB land to significantly
improve on this. We would instead need something like relational
based catalogs.

And of course, relational databases, too, cannot beat the asymptotic
complexity of sorting -- they can only improve the linear factor.

-- 
Dieter