[ZODB-Dev] RFE: Spec for ZODB Indexing

Casey Duncan casey@zope.com
07 Jun 2002 09:55:56 -0400


See:

http://www.zope.org/Members/Kaivo/CatalogQuery
http://www.sf.net/projects/pythonindexer
http://dev.zope.org/Wikis/DevSite/Proposals/ObjectHub
http://dev.zope.org/Wikis/DevSite/Proposals/QueryObject
http://cvs.zope.org/Zope3/lib/python/Zope/ObjectHub/?only_with_tag=Zope-3x-branch

The ObjectHub, which is in development for Zope3 looks to tackle many on
these issues (object relations, indexing, information retrieval). I
might suggest taking a look at this before starting from scratch.

I'll have to say that I disagree that there should be a particular query
language at the root of a ZODB indexing system (other than just pure
Python). The QueryObject proposal and pythonindexer promote an object
based query representation.

Now that's not to say that someone won't want OQL (which is not really
very standardized AFAICT), or some other language to query with. It
should be straightforward to write a parser that converts some OQL
dialect to QueryObjects.

Other things to consider:

  - Indexing can be expensive, and most applications won't want to
    index all the objects in the ZODB. Those objects that are indexed
    won't be indexed in the same way. Perhaps Interfaces should dictate
    if/how an object is indexed.

  - To compensate for the indexing expense, there is development of
    the idea of asynchronous indexing which does indexing in batches
    which increases efficiency.

  - Brute force searching is very expensive (in terms of time and 
    memory) and usually not very useful. Zope has a ZopeFind function
    that does this and it is useful really only for database management.

This is a complex problem that is difficult to solve generically. I wish
you the best of luck!

-Casey 

On Thu, 2002-06-06 at 17:24, Christian Reis wrote:
> 
> I've discussed with Patrick O'Brien and Johan Dahlin over the past week
> on and off something that has turned into this specification, and I
> would like to share it with the other list members so they can make a
> critique of our intention. We already have some code that shows that it
> can be done, but we're not very experienced with the ZODB, and it will
> probably show :-)
> 
> 
> ZODB Indexing DRAFT 1
> Christian Reis for Async Open Source
> kiko@async.com.br
> June 2002
> 
> 1. Introduction
> 
>     ZODB is an object-oriented persistence mechanism for Python.  In the
>     relational world, a database provides both persistence and retrieval
>     means; the ZODB has very simplified provisions for retrieval, and
>     this makes applications always harder to create when they need to
>     access data without a very structured path.
> 
>     This proposal intends to uncover the key aspects of a retrieval and
>     indexing mechanism that would work externally from the ZODB, and
>     independently from Zope, the most common application to use the
>     ZODB. In other words, it intends to specify a set of classes that
>     could be used directly from any application that desires to retrieve
>     objects easily from a ZODB store.
> 
> 2. Key concepts
> 
>     The ZODB provides more than decent persistence. This means that we
>     can trust the objects are saved accordingly, and that they can be
>     retrieved through a graph of object references, including extended
>     types which allow the use of lists and mappings of objects. It also
>     provides transactional mechanisms that allow application-database
>     consistency to be maintained.
> 
>     Our mechanism intends to store and manage these references, so it
>     provides a front-end for getting objects out of the database. 
> 
>     For retrieval to work, there must be a way to specify which
>     object(s) we want. There are a couple of scenarios here:
> 
>         1. We want all objects in the database
>         2. We want a subset of objects in the database, which possess 
>            attributes according to a specified condition. 
>         3. We want a specific object, identified by a reference number
>            which would be something like an oid. XXX: does such a thing
>            exist for OO programming?
> 
>     The first is easily provided by a dump()-like call. The last can be
>     provided by the existing mechanisms for ZODB retrieval. However, the
>     most common form of request is the 2nd, and for that an API for
>     specifying the instances desired must be created.
> 
>     One way to implement the API for retrievals is to provide a
>     simplified query language, where the attributes and conditions could
>     be specified. OQL is quite probably overkill for this, but a
>     simplified, pythonesque language could be devised.
> 
> 2.1. Common conditions
>     
>     When retrieving objects, we'd like to be able to specify the
>     attributes that identify the object in a certain way. The basic
>     Python conditional operators provide a mecahnism we could use:
> 
>     - Equality, and the "is" operator
>     - Greater/Less than comparisons
>     - The in operator for lists
>     - XXX: anything else?
> 
>     We would also like to support boolean operations upon these
>     conditionals - and, or and not.
> 
> 2.2. Joins and totals
> 
>     The SQL join concept is probably not immediately necessary here; it can
>     be argued that joins should really be, in OO modelling, replaced by
>     designing a good network of object references between instances and
>     the other data they will commonly be associated with.
> 
>     However, in the case of making summaries of information stored in
>     the ZODB, there is still need for the suggestion of a good solution.
>     For a simple example, we'll have a database with a collection of
>     Product instances, each with individual stock quantities. How do we
>     provide a means to discover the total stock in the collection?
>     Should the application handle this by caching this information based
>     on updates to individual stock?
> 
> 2.3 Query language
> 
>     An idea for specifying conditions is using a string as a pseudo-query
>     language, which can be easily parsed by a python eval(), such as:
> 
>         "obj.a == foo and obj.b <= bar and obj.c in [ baz, noogie ]"
> 
>     This has the advantage of having this format is familiarity with the
>     actualy language. Granted, we are generating a string for our query,
>     which is definitely not native, but it seems to be an acceptable
>     compromise. XXX: is there another solution, apart from ugly things
>     like [ "obj.a", "<", "43" ]?
> 
>     In this string, we use the special name "obj" to refer to the object
>     which we are querying against. All attributes must be qualified by
>     this name.
>     
>     Additionally, if we can hook into the Python parser for the
>     evaluation, type errors can be automatically caught.  Finally, the
>     possibility of doing str([ a, b, c]) seems interesting for
>     implementing the condition for the in operation. 
>     
>     However, it remains to be seen how easy this is to integrate with
>     indexing, described further on, specially for comparisons and
>     substring matches.
> 
> 3. Implementation
> 
>     The most simple way to implement this API is to provide parsing, and
>     when processing a query, to do sequencial, brute-force searches
>     using an API for the conditionals described above. However, for
>     large collections of objects, this will be insufficient for
>     performance reasons.
> 
>     In this case, an indexing mechanism is necessary. Indexes would
>     allow us to avoid sequential searching of all the objects,
>     providing a fast lookup means for each attribute specified. Indexes
>     require knowledge of instance meta-data, so either they are kept
>     dynamically adapted to the instances schema, or they are explicitly
>     created and destroyed. XXX: do we provide for schema changes?
> 
>     A concept for instance aggregation has to be provided too. This
>     aggregator would hold a collection or instance references and the
>     indexes, and would provide both query interface and index
>     maintenence functionality. This aggregator would also be a ZODB
>     persistent object.
> 
>     String Indexes can use the TextIndex module that is available from
>     Zope.com. XXX: only allows searches based on words, would have to
>     revert to sequential scan for substring, is there a solution?
> 
> 3.1 Index queries
> 
>     The simplest identity comparisons are easily performed with indexes;
>     however comparisons and substring/list operations may prove more
>     difficult. To address this, there should be a fallback to sequential
>     scanning when addressing an query unsupported by the index, as in
>     relational database query processing.
> 
>     This will allow development of this indexing mechanism to evolve
>     gradually; as it becomes more mature, less queries will result in
>     sequential scans.
> 
> 4. API example
> 
>         class Catalog(Persistent):
>             XXX: Entity?
>             def __init__(self, XXX):
>                 XXX: define what kind of instance it stores
>                 XXX: instance meta-data acquisition?
>                 XXX: index auto-creation?
>             def dump(self):
>                 XXX: return list of instances
>             def query(self, q):
>                 XXX: process query
>             def init_index(self, instance_attr):
>                 XXX: specify only indexes you want to avoid bloat?
> 
> 4.1 Query regex
> 
> The expression we are using for parsing at the moment is:
> 
>     "\s*(\".*?\"|'.*?'|\S+)\s+(==|\!=|<=|>=|<|>)\s+(\".*?\"|'.*?'|\S+)"
> 
> 4.2 Query samples 
> 
>     "obj.a < 10"
>     "obj.b > 10 and obj.a < 10"
>     'obj.a == "foobar"'
>     'obj.a in %s' % str([1,2,3]) # ugly
>     'obj.x and obj.y == 1 and obj.z <= 3'
> 
> 5. Development steps
> 
>     - Stabilize requirements for conditionals, summaries and joins
>     - Think up Catalog API
>     - Design query language
>     - Implement basic, brute-force queries
>     - Implement simple indexing
>     - Link queries into indexes
>     - Implement complete indexes for more data types
>     - Make test application
>     - Extend indexing to conditionals
>     - Extend indexing to summaries?
> 
> 
> Take care,
> --
> Christian Reis, Senior Engineer, Async Open Source, Brazil.
> http://async.com.br/~kiko/ | [+55 16] 272 3330 | NMFL
> 
> 
> _______________________________________________
> For more information about ZODB, see the ZODB Wiki:
> http://www.zope.org/Wikis/ZODB/
> 
> ZODB-Dev mailing list  -  ZODB-Dev@zope.org
> http://lists.zope.org/mailman/listinfo/zodb-dev