[Zope3-dev] catalog 'all documents' abstraction

Gary Poster gary at zope.com
Wed Aug 31 10:50:21 EDT 2005


On Aug 31, 2005, at 5:41 AM, Martijn Faassen wrote:

> Gary Poster wrote:
>
>> On Aug 30, 2005, at 1:57 PM, Martijn Faassen wrote:
>>
> [snip]
>
>
>>> It would be helpful if someone could explain the motivations behind
>>>  the extent catalog, by the way -- this information seems to be  
>>> missing in zc.catalog. Am I at all on the right track with my  
>>> thinking on it?
>>>
>
>
>> It should be pointed out initially that the son-of-queued-catalog
>> code doesn't have anything to do with extents.  I think Jim wants
>> that factored out when we have time so that can be a mix-and-match  
>> capability.  I think you are asking about extents themselves, though.
>>
> Okay, I didn't realize yet glancing at this that this is *also*
> son-of-queued catalog. Interesting. I'll glance at it some more. :)

As Jim said, characterizing it as "son of queued catalog" is perhaps  
hyperbole.  Maybe "a queued catalog that is easier to set up and  
helpful but not as effective as the Zope 2 queued catalog" or simply  
"nephew of queued catalog" would have been better.  :-)

>> We had three use cases that led us to extents.
>> First, we wanted several catalogs that only indexed certain different
>>  things.  This could have been done by subscribers, so this wasn't  
>> terribly compelling by itself.
>>
>
> Okay, this is clear. It's not that clear to me how to efficiently  
> make a
> subscriber only handle one object type (I've been using the "is this a
> IFoo? If not, return" pattern at the start of subscribers), but that's
> another discussion.

Stephan replied to this.

Interestingly, because catalogs forward their indexing requests to  
the contained indexes, they act a bit like an event channel, even  
though the event itself is no longer part of the communication.   
Therefore the extent catalog's approach is still reasonably  
efficient, without an intermediate filtering subscriber, as long as  
you don't need the filtering subscriber to ping any other components  
too.

>> Second, we wanted to transparently support queries that merged
>> results across catalog-like data structures.  The catalog defined the
>> items we wanted to search through, while some of the other data
>> structures kept track of a larger set of objects (subsuming the set
>> that the catalog cared about).  Sometimes, users could perform a
>> query that didn't actually use any of the catalog's data structures,
>> but that should be filtered by the set of the catalog's objects--its
>> extent.
>
> I'm not sure I comprehend the motivations behind this one. Could you
> elaborate?

Sure.

What I'm about to describe wasn't our exact use case, but is a  
reasonable example, I hope.

Imagine you have a two components: one that keeps track of how often  
an object is viewed, and a catalog.  The view tracker might use  
intids, but because of ConflictError problems with write-on-read, you  
probably wouldn't want to store the data in the ZODB; moreover, it is  
not an index.  It's a separate component.

Further, imagine that the view tracker keeps track of more objects  
than your catalog does (maybe you have multiple catalogs, maybe the  
view tracker has responsibilties for non-content objects).

Now you are building a search form for your content objects.  You  
don't want your user to be aware of the separation of  
responsibilities in your design: you want to let the user say things  
like "show me all the content objects Martijn created that have been  
viewed more than 100 times" or something like that.  That's a catalog  
query intersected with a view tracker query: no extent needed.

You also want to let the user say "show me all the content objects  
that have been viewed more than 100 times".  The content objects set  
is defined by the catalog: it's the set you are searching through.   
But the view tracker doesn't have any concept of that set--it has its  
own larger responsibilities.  Enter extent, stage right.  The  
catalog's extent can be intersected with the view tracker results,  
and the user gets the expected results.

End example.

The view tracker is of a class of components that can generate  
results that might need to be transparently merged with cataloged  
objects, sometimes when there is no catalog query.  Extents are a  
solution to the part of the story when you have no catalog query.

>> Third, we wanted to let our indexes data be usable for NOT queries.
>> In order to do that, we needed an IFBTree structure that describes
>> the complete set for a given catalog, so that a contained index can
>> simply (and reasonably efficiently) subtract the query result from
>> the full set.  The indexes in zc.catalog also use extents for some
>> other similar tricks.
>>
>
> This one's also clear.
>
>
>> An extent that accepts all objects would effectively be the data  
>> structure you want, as I understand it.
>
> I'm not sure -- 'not' is indeed context dependent, so which extent  
> is in
> use to determine the results of a 'not' operation depends on the  
> query.
> I think it's okay to ask the users to explicitly specify the extent  
> when
> they're doing the query, as long as there's an easy way to  
> construct it
> for the simple cases.

"not" only really needs an extent if it is the only query.  We don't  
optimize stuff this way now, but in theory, if you say "give me  
content with 'cats' in the title that Gary didn't create", that's  
steps of

- set1 = content with 'cats'
- set2 = content Gary created
- return set1 - set2

You only need an extent if you want "give me content objects that  
were not created  by the engineers".  Then, the catalog's extent is  
just what you want.

- set1 = content created by the engineers
- return extent - set1

The naive approach to the first example that does use an extent but  
works harder is

- set1 = content with 'cats'
- set2 = content Gary created
- set3 = extent - set2
- return intersection of set1 and set3

>> It is actually (at least typically for us) different than the intid
>> mapping because there are several classes of things that have intids
>> that are not cataloged.
>>
>
>
>> If more than one catalog all index the same
>> objects, I'd first wonder  why the indexes were not all in the same
>> catalog;
>>
>
> Good question. I think one example of such a scenario is if you  
> wrote a codebase, and I extended this codebase with some adapters  
> which carry around information I also want indexed. I may decide  
> not to introduce new indexes into your catalog but instead produce  
> my own catalog to have the concerns separate from each other. In  
> this case I'd want to do queries over multiple catalogs which index  
> the same objects.

OK, understood.  Seems like you would only want to do that if you had  
a pretty compelling reason to separate the concerns, but I'll grant  
that such reasons probably exist. :-)

>> I'd second say that  yes, they probably could share a
>> filter-less extent.
>>
>
> Why filter-less? I mean, wouldn't you want to filter on object type?

Default catalogs are filter-less--that is, any filtering happens  
before you get to them, in a subscriber.

All I'm trying to say is that, if all your catalogs are guaranteed to  
handle the same set of objects, then sure, you could share extents.

>> If we want any of zc.catalog in the Zope core, each component will  
>> certainly need a proposal, by the way: we're offering this as code
>> that has helped us out and that we think might help others, either
>> directly or as ideas, so we are not duplicating effort.  We're not
>> proclaiming it to necessarily be our next core step.
>
> Understood. I'm giving you feedback. :)

Much appreciated. :-)

> We (Infrae) are going to put some code online eventually that we  
> produced in the project we're working on, for the same reasons.

Sounds awesome.

Gary


More information about the Zope3-dev mailing list