[Zope3-dev] Florent's O-R blog entry

Wed Aug 24 12:02:05 EDT 2005

Gary Poster wrote:
> 
> On Aug 24, 2005, at 6:27 AM, Martijn Faassen wrote:
[snip]
>> Now as to where I see areas where features are lacking in the Zope  3 
>> catalog:
>>
>> Underfeatured query API
>> -----------------------
[snip query API discussion]

> Another way of looking at this--or simply an additional feature on  top 
> of a query langauge--might be to make the IFBTree results easier  to 
> manipulate in an easier way.  The code in the zc sandbox for the  extent 
> (http://svn.zope.org/Sandbox/zc/catalog/extentcatalog.py) is a  sketch 
> of what I mean--following some of the set API, for instance.   The 
> reason for my interest in this is that we have very little code  that 
> uses the catalog to return objects--just IFBTree data  structures.  Just 
> working with the IFBTree data structures gives you  a lot more 
> flexibility for integration of catalog results with other  data structures.

My code works mostly with the IFBTree objects as well, though I'll have 
to check out your code to see what you mean exactly.

> Casey Duncan had explored some very interesting ideas in his pypes  
> project (http://cvs.zope.org/Packages/pypes/) for a query language,  by 
> the way, but his ambition is still largely unrealized, even though  much 
> of his query language work could be ported to Zope 3 without a  huge 
> amount of trouble.

I need to take a look at this; I've seen the checkins but never quite 
got what this was about. :)

> Arguably, query optimization would be the feature that would make a  
> given syntax win.

Or at least a given AST; a syntax is not strictly necessary.

>> Fast, easy batching/sorting
>> ---------------------------
>>
>> I don't know how to do easy, efficient batching/sorting with the  
>> catalog. I'd like to be able to query *just* a batch of objects,  
>> sorted, for user interface purposes. There doesn't seem to be a  
>> straightforward way to do this yet, and this is a very common use  
>> case. The batching implementation sitting out there in  
>> zope.bugtracker.batching is nice, but doesn't deal with the catalog.
>>
>> I think this should be fixable with a bit more infrastructure  though. 
>> Getting the right batch is just a query on an index, and  the result 
>> can be sorted afterwards, though there are tricky issues  getting the 
>> right batch *size*.
> 
> Since we usually work with IFBTree data structures until the very  end, 
> we get most of the benefits of batching.  Once you are done with  your 
> processing, you can simply wrap the result in a  
> zope.app.catalog.catalog.ResultSet (or similar) and be good to go.   
> This is why I think making the story for working directly with the  
> BTree data structures easier might be a good way to go.

But my batching depends on sorting. I.e. I want to batch through a 
sorted list of results.

> Sorting is hard to do efficiently, and easy to *think* you are making  
> an optimization.  We are currently doing it "naively" (to the degree  
> that using the very efficient Python sort is naive), and Jim refers  to 
> research that indicates that a good non-naive approach is not  clear.  I 
> can certainly imagine various approaches.  Carefully  arranging your 
> merges can actually result in a pre-sorted set.  We're  not being that 
> careful.

I use a very naive approach now too, but that means I have to do the 
realization of all the objects into a ResultSet *before* the sort can 
happen. Waking up all those objects just to sort them for each batch 
feels wrong. I'm not being careful enough with my query operations 
either to have a pre-sorted set.

>> Missing powerful query concepts
>> -------------------------------
>>
>> Certain powerful query concepts like joins, available in a  relational 
>> setting, are missing. I've already run into a scenario  where I wanted 
>> to someting like this: given a bunch of version  objects with field 
>> 'id', where multiple objects can have the same  'id' to indicate 
>> they're versions of the same object, I want all  objects where field 
>> 'workflow_state' is 'PUBLISHED' unless there is  another object with 
>> the same id that have workflow_state 'NEW', in  which case I want that 
>> one'.
>>
>> I think joins would be a way to solve it, though I haven't figured  
>> out the details, nor how to implement them efficiently on top of  the 
>> catalog. This kind of thing is where a relational database  makes life 
>> a lot simpler.
> 
> I guess that's taste.  I'd be happier with Python.

It's performance too, not just taste. I can solve this in Python, but it is:

* more, harder to read code.

* much much slower than it potentially could be.

I.e. now I have code that looks approximately like this:

def newestVersions():
     """Return newest versions of a particular object.

     If there is a version that's PUBLISHED and a version that's NEW,
     return NEW version only in results.

     Multiple versions of the same object are identified with an id
     they share.
     """
     state_index = 'document_catalog', 'workflow_state'
     id_index = 'document_catalog', 'worflow_id'
     query = InSet(state_index, [NEW, PUBLISHED])
     q = zapi.getUtility(IExtendedQuery)
     for version in q.searchResults(query):
         s = version.getState()
         if s == PUBLISHED:
             id = version.getId()
             query2 = And(InSet(state_index, [NEW]),
                          Equals(id_index, id))
             if q.searchResults(query2):
                continue # skip this result, as there's a new version
         yield version

the second, inner query is not very pleasant to do and I'd prefer to 
avoid it by having a way to do a join on the workflow_id index. 
Readability wise, I'd prefer to be able to write a single query instead 
of a complicated loop.

I realize that one can always say: "You should've designed your 
application differently to avoid this issue", but I think the general 
pattern where you want to ask:

   give me all objects with field A having state 1 unless there's 
another
   object somehow related to it through field B, that has field A state 2

is something that will appear in applications and that should have a 
reasonably succint query representation with a fast answer. Relational 
databases offer this power, but the Zope Catalog right now doesn't seem to.

[snip]

>> Ease of deployment
[snip]

> While I agree with your general point, Ruby on Rails might call that  
> assertion into question a bit.

True, and so does Django and so on. I wonder whether they have as much 
of a commons of shared components as Zope 2 does; I'm not familiar 
enough with the projects to judge this.

[snip]
> Good points.
> 
> I'll add another.
> 
> Component system
> 
> Because of the Zope 3 component system, if we can use the current  
> catalog interface, or invent another, to develop both a ZODB/BTree- 
> based implementation and an RDBMS-based implementation, it's possible  
> that users who wanted to choose the RDBMS strengths would be able to  do 
> so without dividing the user community.

Yes, this is bringing the transparency a step further. Basically what 
you'd be doing is building an object relational abstraction based on the 
Zope catalog. :)

[snip]
>> While the transparency has many  benefits 
>> mentioned before, the more straightforward mapping has the  benefits 
>> of simplicity, may map to relational databases more  easily, and may 
>> expose powerful relational features more  straightforwardly.
> 
> It's true.  I hope that an entire platform doesn't force the decision  
> on its potential users, though.

Agreed.

Regards,

Martijn