[ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

Mon Mar 26 15:55:33 EDT 2007

On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote:

> Jim Fulton wrote at 2007-3-25 09:53 -0400:
>>
>> On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
>>> MF> I think one of the main limitations of the current catalog (and
>>> MF> hurry.query) is efficient support for sorting and batching the
>>> query
>>> MF> results. The Zope 3 catalog returns all matching results, which
>>> can then
>>> MF> be sorted and batched. This will stop being scalable for large
>>> MF> collections. A relational database is able to do this
>>> internally, and is
>>> MF> potentially able to use optimizations there.
>>
>> What evidence to you have to support this assertion?  We did some
>> literature search on this a few years ago and found no special trick
>> to avoid sorting costs.
>>
>> I know of 2 approaches to reducing sort cost:
>>
>> 1. Sort your results based on the "primary key" and therefore, pick
>> your primary key to match your sort results.  In terms of the Zope
>> catalog framework, the primary keys are the document IDs, which are
>> traditionally chosen randomly.  You can pick your primary keys based
>> on a desired sort order instead. A variation on this theme is to use
>> multiple sets of document ids,  storing multiple sets of ids in each
>> index.  Of course, this approach doesn't help with something like
>> relevance ranks.
>>
>> 2. Use an N-best algorithm.  If N is the size of the batch and M is
>> the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
>> is a significant improvement if N << M, but still quite expensive.
>
> The major costs in sorting are usually not the "log(n)" but
> the very high linear costs fetching the sort keys (although for  
> huge n,
> we will reach the asymptotic limits).

Right. The problem is the "N" not the log(N). :)

> Under normal conditions, a relational database can be far more  
> efficient
> to fetch values either from index structures or the data records
> than Zope -- as
>
>   * its data representation is much more compact
>
>   * it often supports direct access
>
>   * the server itself can access and process all data.
>
>
> With the ZODB, the data is hidden in pickles (less compact), there is
> no direct access (instead the complete pickle need to be decoded)

The catalog sort index mechanism uses the un-index data structure in  
the sort index to get sort keys. This is a pretty compact data  
structure.

> and
> all operations are done in the client (rather than in the server).

Which is often fine if the desired data are in the client cache.  It  
avoids making the storage a bottleneck.

Jim

--
Jim Fulton			mailto:jim at zope.com		Python Powered!
CTO 				(540) 361-1714			http://www.python.org
Zope Corporation	http://www.zope.com		http://www.zope.org