[Zope3-dev] Sprintathon: searching?

Mon, 04 Nov 2002 16:35:37 -0500

> > Are there plans to work on searching at the Rotterdam Sprintathon?
> 
> Yes. Are you planning to be there?

Alas, not.  I would have loved to be in Holland for Sinterklaas, but
Zope's travel budget is limited.  I can be on IRC though (albeit with
the usual timezone shift).

> > I might help in two ways:
> > 
> > - I can try to write up use cases / stories / tasks.
> 
> This would be really useful. I'd really like to see a collection of 
> use-cases that describe concrete examples of the requirements people 
> have in Zope 2.

I'm not sure I can help you with that, but I'm an experienced searcher
(:-), so I can provide typical search scenarios.  Note that
ZCTextIndex only does full text search -- I personally find other
kinds of searches much less useful so I tend to focus on full text.

> > - I can help with porting ZCTextIndex (which I consider the current
> >   best practice) to Zope 3.
> 
> I think this would be a good task for some people at the sprintathon.
> 
> What kind of preparatory work do you think would help? Are there any 
> documents that describe the scope of ZCTextIndex, and how it is put 
> together? Do you think it will port "in one piece", or would it benefit 
> from "componentizing" internally?

There's a bit of background and description in
Products/ZCTextIndex/README.txt (in Zope2/lib/python).

There are really two parts to the code in the Products/ZCTextIndex
directory: the ZMI interface and the indexing engine itself.  The ZMI
interface, which also defines how ZCTextIndex hooks into ZCatalog,
needs to be rewritten from scratch using Zope3 APIs.  The indexing
engine itself should require very minor tweaks; it's just a Python
application that uses some persistent classes, and doesn't have much
Zope-specific code in it.  I expect that the only changes required are
changes in the Persistent API.

There's a little bit of C code, so a setup.py probably needs to be
made.  There may be a few things that aren't worth keeping
(e.g. CosineIndex -- I believe OkapiIndex is always better).  The
RiceCode.py file is what Jim would call a "decoy" -- it's not used.

An open problem that's not really solved completely in the Zope2 case
(I believe) is how to extract the indexable text from an object once
it is decided that it should be indexed.  There's a flexible
tokenization pipeline, but this is configured once per index instance
rather than per document type.  I imagine adapters should work nice
here.

This may also hook into the "batching" catalog -- that's a Zope2 thing
about which I know nothing except that it exists (I don't even know
its name).  This moves catalog updates out of the current transaction,
batching them up in a queue that is inspected every so often by a
background thread (I believe).  But perhaps that should be a separate
service in Zope3.

--Guido van Rossum (home page: http://www.python.org/~guido/)