[Zope3-dev] Sprintathon: searching?

Steve Alexander steve@cat-box.net
Mon, 04 Nov 2002 23:57:01 +0200


> I can be on IRC though (albeit with the usual timezone shift).

I'm sure that will be very helpful if we work on porting ZCTextIndex.


>>>I might help in two ways:
>>>
>>>- I can try to write up use cases / stories / tasks.
>>
>>This would be really useful. I'd really like to see a collection of 
>>use-cases that describe concrete examples of the requirements people 
>>have in Zope 2.
> 
> I'm not sure I can help you with that, but I'm an experienced searcher
> (:-), so I can provide typical search scenarios.  Note that
> ZCTextIndex only does full text search -- I personally find other
> kinds of searches much less useful so I tend to focus on full text.

The other kinds of search are typically more useful for application 
designers, or when people are classifying things hierarchically or 
according to keywords.

But, for websites, and regular content, a good text index is essential.


> There's a bit of background and description in
> Products/ZCTextIndex/README.txt (in Zope2/lib/python).

I'll read that tomorrow.


> There are really two parts to the code in the Products/ZCTextIndex
> directory: the ZMI interface and the indexing engine itself.  The ZMI
> interface, which also defines how ZCTextIndex hooks into ZCatalog,
> needs to be rewritten from scratch using Zope3 APIs.  The indexing
> engine itself should require very minor tweaks; it's just a Python
> application that uses some persistent classes, and doesn't have much
> Zope-specific code in it.  I expect that the only changes required are
> changes in the Persistent API.

Perhaps some thought needs to be applied to how to ensure bugfixes to 
the core of the code get applied to both Z2 and Z3 versions.


> There's a little bit of C code, so a setup.py probably needs to be
> made.  There may be a few things that aren't worth keeping
> (e.g. CosineIndex -- I believe OkapiIndex

Is that some kind of dwarf zebra?

> is always better).  The
> RiceCode.py file is what Jim would call a "decoy" -- it's not used.

Why is it there? I guess I'll find out when I read it.


> An open problem that's not really solved completely in the Zope2 case
> (I believe) is how to extract the indexable text from an object once
> it is decided that it should be indexed. 

In Zope3, I guess there will be an interface, ISearchableText, that 
defines a getSearchableText() method, and adapters from most content 
types to that interface. (Change spelling according to taste.)


> There's a flexible
> tokenization pipeline, but this is configured once per index instance
> rather than per document type.  I imagine adapters should work nice
> here.

Yep. Ok, some component engineering to look at :)


> This may also hook into the "batching" catalog -- that's a Zope2 thing
> about which I know nothing except that it exists (I don't even know
> its name).

It is in cvs at http://cvs.zope.org/Products/QueueCatalog/


>  This moves catalog updates out of the current transaction,
> batching them up in a queue that is inspected every so often by a
> background thread (I believe).  But perhaps that should be a separate
> service in Zope3.

It will be some special EventChannel, I think.

That would be a good task for the sprint too, if no one tackles it 
before then.

--
Steve Alexander