[ZODB-Dev] RFE: Spec for ZODB Indexing

Christian Reis kiko@async.com.br
Wed, 18 Sep 2002 16:48:10 -0300


On Fri, Jun 07, 2002 at 09:55:56AM -0400, Casey Duncan wrote:
> The ObjectHub, which is in development for Zope3 looks to tackle many on
> these issues (object relations, indexing, information retrieval). I
> might suggest taking a look at this before starting from scratch.

We did consider this and looked at the code, but our requirements are
for basic collections, with indexing, queries and aggregation. The
ObjectHub provides much more than we are looking for, since it tackles
the problems of locating objects in complex hierarchies, and joins,
which are not parts of our main use patterns.

> I'll have to say that I disagree that there should be a particular query
> language at the root of a ZODB indexing system (other than just pure
> Python). The QueryObject proposal and pythonindexer promote an object
> based query representation.

This is no issue with IndexedCatalog, at any rate, since the parser and
the indexing mechanism are quite separate. Even the query engine is
simple enough to allow having another query language attached.

We'd be more than happy to see a QueryObject to IndexedCatalog, and
would be willing to work with anyone interested in developing it with
us. At the moment, the query string is both simple and comprehensive
enough to just "workforme (tm)" :-)

> Now that's not to say that someone won't want OQL (which is not really
> very standardized AFAICT), or some other language to query with. It
> should be straightforward to write a parser that converts some OQL
> dialect to QueryObjects.

That's conceptually our approach with IC - the parser just needs to
return blocks that are sent off to the respective indexes.

>   - Indexing can be expensive, and most applications won't want to
>     index all the objects in the ZODB. Those objects that are indexed
>     won't be indexed in the same way. Perhaps Interfaces should dictate
>     if/how an object is indexed.

We can't rely on Zope, so our implementation of this is a class
attribute, _ic_ignore, which allows you to specify fields that shouldn't
be used when indexing. These fields can't be queried for, and
an exception is raised when the query string contains it.

>   - To compensate for the indexing expense, there is development of
>     the idea of asynchronous indexing which does indexing in batches
>     which increases efficiency.

When processing large amounts of data, I agree this might be a problem.
At the moment, for our application (pretty much an OLTP-type app), we
get away with the fact that database imports don't happen very often,
and most of the data is entered piecemeal. Performance is also not high
on our priority list, since the current speed (with Indexing --
brute-force of course is miserable) is more than acceptable.

We haven't started handling concurrency inside IC at any rate, which
comes to mind if the simple "use-another-thread-for-that" approach is
wanted.

>   - Brute force searching is very expensive (in terms of time and 
>     memory) and usually not very useful. Zope has a ZopeFind function
>     that does this and it is useful really only for database management.

And yet, for substring searches, there are not many known alternatives.
Queries for words could use TextIndex, but beyond that, does anyone have
suggestions of ways to implement this?

Take care,
--
Christian Reis, Senior Engineer, Async Open Source, Brazil.
http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL