[ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

Jim Fulton jim at zope.com
Sun Mar 25 13:50:12 EDT 2007


On Mar 25, 2007, at 12:33 PM, Martijn Faassen wrote:

> Hey Jim,
>
> Jim Fulton wrote:
>
>> On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
>>> MF> I think one of the main limitations of the current catalog (and
>>> MF> hurry.query) is efficient support for sorting and batching  
>>> the query
>>> MF> results. The Zope 3 catalog returns all matching results,  
>>> which can then
>>> MF> be sorted and batched. This will stop being scalable for large
>>> MF> collections. A relational database is able to do this  
>>> internally, and is
>>> MF> potentially able to use optimizations there.
>> What evidence to you have to support this assertion?
>
> I have the strong suspicion that modern relational databases are  
> currently better able to scale at queries using LIMIT and ORDER BY  
> than the Zope 3 catalog.

I had a similar suspicion.  I assigned the Python Labs team the task  
of finding out through literature search the approaches used.  They  
found that there were none other than the sorts of things I've  
mentioned.

> I cannot back this up as I haven't done measurements. Perhaps you  
> have done so?

We did a literature search.

> * Do you estimate the performance of the Zope 3 catalog to be  
> equivalent to the performance of a modern relational database  
> system for queries that need to sort and batch their results?

I estimate that the same issues apply to both.

> * If so, do you think it's just as easy for a developer to  
> accomplish such equivalent performance with the Zope 3 catalog as  
> it is with a relational database?

It depends on who the developer is.  :)  Using an alternative primary  
key isn't obvious.  Keeping extra sets of keys isn't supported in the  
current index implementations.


> I've made a number of assertions:
>
> a) one of the main limitations of the current catalog and  
> hurry.query is efficient support for sorting and batching.
>
> b) the Zope 3 catalog returns all matching results, which can then  
> be sorted and batched. This will stop being scalable for large  
> collections.
>
> I'll amend b) by saying 'This will stop being scalable for large  
> result sets'. I agree that b) as stated above is incorrect as the  
> result set might be small, but I intended the amendment.

And I assumed it.  I'll assert that even an N-best approach isn't  
really scalable either.  The only way I know of do scalable result  
sorts is to use the primary keys or keep supplemental sort key data  
in each of the indexes.

> c) A relational database is able to do sorting and batching (limit  
> queries) internally, and is potentially able to use optimizations  
> here.

Except that there aren't really any optimizations to use.

>
> Which of these assertions are false?

The implied assertion that relational databases can do sorts scalably  
on anything but a primary key.

> Don't you think relational database system that has support for  
> sorting and batching built into its query API can at the very least  
> more easily use approaches to reduce sorting cost, by rewriting the  
> query, caching, and potentially employing special indexes?

I think it's conceivable, and with careful database design, even likely.

>
>> We did some literature search on this a few years ago and found no  
>> special trick to avoid sorting costs.
>
> I am at least cursorily aware of challenges surrounding efficient  
> querying and batching. I am not looking for a special trick or  
> magic bullet. I'd just like more help in avoiding sorting cost in a  
> typical situation where results are displayed in a batched format.

But the primary issue isn't really batching.  An N-best algorithm cab  
reduce sort cost a lot, but not enough to be scalable.


> If a catalog query returns 1 million results, which I want to show  
> in batches of 10, sorted by some property of the results, I would  
> like to reduce the costs. Currently the pattern I (and I imagine  
> others) employ is to re-execute the query and then sort these  
> results in memory for each batch, for each request.

No, the current state of the art is to do an N-best operation on the  
result. I'm 99% some Zope catalog code does this.  My point is that  
this isn't good enough.

> [you list some approaches to reduce sorting cost]
>
> I would like some system that helps me reduce some of these costs,  
> using the approaches you list, or at least some caching somewhere.  
> I would imagine a relational database for instance can employ  
> caching of result sets, so that if no writes occurred, a second  
> LIMIT query asking for a different range will return results a lot  
> faster.
>
> Apparently the catalog does support N-best, you state later in this  
> thread.

I'm pretty sure that that is the case.  I know we built it into  
ZCTextIndex.

> How does one use this support? Can I add it to hurry.query somehow?

Heck if I know. I haven't worked with catalog code for a long time.  :)

Look at zope.index.nbest.

> Perhaps all this is not the reponsibility of the catalog itself,  
> but a system surrounding it. As long as it's obviously there for  
> people to use.

Certainly, n-best should generally be integrated into searching code.  
An n-best algorithm helps, but it doesn't really achieve scalability.

> Perhaps however I am seeing problems that aren't there?

I think you are seeing solutions that aren't there, or at least  
aren't as straightforward as you (or people misinterpreting your  
statements) might expect.

> Do you think there is no problem and we have parity with relational  
> database implementations here?

Obviously, in general, we don't have parity with RDBMSs on query  
automation.

My main point is that there is no simple way to do sorting scalably.   
Sorting is a trap that I've seen many projects fall into.  People  
expect that because the catalog has some sort of sorting capability  
or because SQL has sorting keywords that somehow the "experts" have  
figured out how to make it go fast "automatically".

> Do you think the current situation cannot be improved much further?

For some definition of "much" yes, I think it can't be improved much  
further wrt sorting.  Scalable sorting is going to have a major  
impact on an application design.  It will be disruptive.  We can  
better educate people about that.

> Do you think any further improvements are not worth the costs?

I think further improvements are warrented, but they will not achieve  
the goal that many people expect.

Jim

--
Jim Fulton			mailto:jim at zope.com		Python Powered!
CTO 				(540) 361-1714			http://www.python.org
Zope Corporation	http://www.zope.com		http://www.zope.org





More information about the ZODB-Dev mailing list