[Zope3-dev] catalog 'all documents' abstraction

Tue Aug 30 14:25:37 EDT 2005

On Aug 30, 2005, at 1:57 PM, Martijn Faassen wrote:

> Jim Fulton wrote:
>
>> Martijn Faassen wrote:
>>
> [snip]
>
>
>>  > ). I also think however that it's the wrong
>>
>>> place the ask for this information, as this doesn't work with the  
>>> extentcatalog.
>>>
>>  Well, it depends on what you meant by "indexed" above.  Different  
>> indexes
>> index different objects.  The extent catalog tried to define an  
>> extent for
>> which it makes sense to apply different (not) operations.
>>
>
> And is the idea that multiple extent catalogs can share an extent?

They could.  We haven't needed that yet.

> (By the way you say 'the extent catalog tried', does this mean  
> something else is being considered now?)

Not to my knowledge, and I just asked Jim, and he said there was no  
special significance to the past tense.  :-)

>>> The catalog itself seems like the wrong place to ask as well  
>>> however, as things would get hairy in the case of a query over  
>>> multiple catalogs -- which catalog would be asked for all ids  
>>> that are indexed?
>>>
>> In general, I'd like the catalog to remain fairly small and free  
>> of logic.
>> I wanted to say this in the other thread you started on  
>> cataloging, but
>> didn't get to it.  Ideally, a catalog wouldn't have any query logic
>> at all.  People should be able to invovate on query algorithms  
>> without
>> affecting catalogs.
>>
>
> This is already clear; I've been trying to do so in the project I'm  
> working on, though I'm more focusing on a sensible query language  
> (well, of Python objects) than performance algorithms.
>
> At the same time I believe Zope 3 *does* need query systems built  
> in eventually. While it's fine to allow people to design their own  
> query languages and algorithms, not everybody is able to do this,  
> and those who are able to don't always want to. Even if they did, I  
> don't want us to end up with 5 different query systems either.
>
> So, while I agree that a query language in the core should not  
> exclude someone from building something better, I do believe a  
> catalog query language package is needed in the core.
>
> (To be absolutely clear: I also think the RDF avenues being  
> explored are very interesting, and I don't want to imply that this  
> is not an interesting direction, but I do think we need something  
> for the plain catalog too)

This all makes sense to me, btw (as is probably clear by my RDF  
messages).  Query language arguments have been persuasive to me.   
That said, I still don't find the lack of a query language to be an  
impediment.  It's a nice-to-have for me and arguably an essential-to- 
have to support a larger audience.

>>> Hm, perhaps this isn't ideal either, as this would get hairy in  
>>> case of a query that spans multiple catalogs -- which catalog  
>>> will be asked in that case for a list of all documents?
>>>
>> I think in the particular case of "not", you have to have an  
>> implicit or
>> explicit set that you are subtracting something from.  The "right"  
>> set is
>> application specific.  In any case, I think the query logic should be
>> in separate query components.
>
> I agree that the catalog should remain nice and simple.
>
> That said, catalogs right now already have an implicit concept of  
> 'everything indexed', which for instance is already used for re- 
> indexing, it's just not made available to someone who wants to  
> build something on it.
>
> The extent catalog makes this more explicit by defining an extent,  
> so perhaps this is the way to go. The extent could be a query  
> parameter to help the query engine figure out what to do in case of  
> 'not'. For simple use cases, the extent can be constructed from the  
> intid utility, perhaps.
>
> It would be helpful if someone could explain the motivations behind  
> the extent catalog, by the way -- this information seems to be  
> missing in zc.catalog. Am I at all on the right track with my  
> thinking on it?

It should be pointed out initially that the son-of-queued-catalog  
code doesn't have anything to do with extents.  I think Jim wants  
that factored out when we have time so that can be a mix-and-match  
capability.  I think you are asking about extents themselves, though.

We had three use cases that led us to extents.

First, we wanted several catalogs that only indexed certain different  
things.  This could have been done by subscribers, so this wasn't  
terribly compelling by itself.

Second, we wanted to transparently support queries that merged  
results across catalog-like data structures.  The catalog defined the  
items we wanted to search through, while some of the other data  
structures kept track of a larger set of objects (subsuming the set  
that the catalog cared about).  Sometimes, users could perform a  
query that didn't actually use any of the catalog's data structures,  
but that should be filtered by the set of the catalog's objects--its  
extent.

Third, we wanted to let our indexes data be usable for NOT queries.   
In order to do that, we needed an IFBTree structure that describes  
the complete set for a given catalog, so that a contained index can  
simply (and reasonably efficiently) subtract the query result from  
the full set.  The indexes in zc.catalog also use extents for some  
other similar tricks.

An extent that accepts all objects would effectively be the data  
structure you want, as I understand it.  It is actually (at least  
typically for us) different than the intid mapping because there are  
several classes of things that have intids that are not cataloged.   
If more than one catalog all index the same objects, I'd first wonder  
why the indexes were not all in the same catalog; I'd second say that  
yes, they probably could share a filter-less extent.

If we want any of zc.catalog in the Zope core, each component will  
certainly need a proposal, by the way: we're offering this as code  
that has helped us out and that we think might help others, either  
directly or as ideas, so we are not duplicating effort.  We're not  
proclaiming it to necessarily be our next core step.

Gary