[ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

Sun Mar 25 09:53:07 EDT 2007

On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
> MF> I think one of the main limitations of the current catalog (and
> MF> hurry.query) is efficient support for sorting and batching the  
> query
> MF> results. The Zope 3 catalog returns all matching results, which  
> can then
> MF> be sorted and batched. This will stop being scalable for large
> MF> collections. A relational database is able to do this  
> internally, and is
> MF> potentially able to use optimizations there.

What evidence to you have to support this assertion?  We did some  
literature search on this a few years ago and found no special trick  
to avoid sorting costs.

I know of 2 approaches to reducing sort cost:

1. Sort your results based on the "primary key" and therefore, pick  
your primary key to match your sort results.  In terms of the Zope  
catalog framework, the primary keys are the document IDs, which are  
traditionally chosen randomly.  You can pick your primary keys based  
on a desired sort order instead. A variation on this theme is to use  
multiple sets of document ids,  storing multiple sets of ids in each  
index.  Of course, this approach doesn't help with something like  
relevance ranks.

2. Use an N-best algorithm.  If N is the size of the batch and M is  
the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which  
is a significant improvement if N << M, but still quite expensive.

I don't think relational databases have any magic bullet to get  
around sorting costs.  Sorting is expensive.  In many ways, I think  
the sorting support in the catalog gave people a false sense of  
security.

Jim

--
Jim Fulton			mailto:jim at zope.com		Python Powered!
CTO 				(540) 361-1714			http://www.python.org
Zope Corporation	http://www.zope.com		http://www.zope.org