[Zope3-dev] Sprintathon: searching?

Guido van Rossum guido@python.org
Tue, 05 Nov 2002 11:45:41 -0500


> > I'm not sure I can help you with that, but I'm an experienced searcher
> > (:-), so I can provide typical search scenarios.  Note that
> > ZCTextIndex only does full text search -- I personally find other
> > kinds of searches much less useful so I tend to focus on full text.
> 
> The other kinds of search are typically more useful for application 
> designers,

Martijn Faassen replied to this too:

> Hm, I'm not sure what this means. Other kinds of searches (like a 
> catalog FieldIndex) are definitely important in much application logic.
> The use case here is often that the user is not building a search
> query directly but that the application is doing a query for the user,
> similar to the way SQL queries often aren't directly entered by the user
> but are built into the particular application. 

Good point; I'd forgotten about that.  Searches for meta_type=="News
Item" and so on are useful.  This is a totally separate function about
which I don't know much.

> For the end user I do agree full text searches tend to be the most
> useful. Anyway, all this doesn't mean you should focus on other kinds
> of indexes, I just wanted to make this use case explicit. The current
> ZCatalog suffers from trying to accomodate for a human searcher and
> programmatic searches both; it's been extended over time from the first
> to the second, and it's not pretty. Better would be to have
> a core API for programmatic searches with perhaps an adapter on top of it
> to help with the common case of human full text searches (if the latter
> turns out to be necessary at all).

That's actually a good point -- perhaps the programmatic non-full-text
searches need to be accommodated by a different architecture than the
end-user full-text search.  (About the only qualification to the
full-text search that might occasionally be handy is a restriction on
date ranges, I'd expect.)

[back to StevaA; the > > quotes is my original post]
> or when people are classifying things hierarchically or 
> according to keywords.

> But, for websites, and regular content, a good text index is essential.
> 
> > There's a bit of background and description in
> > Products/ZCTextIndex/README.txt (in Zope2/lib/python).
> 
> I'll read that tomorrow.
> 
> 
> > There are really two parts to the code in the Products/ZCTextIndex
> > directory: the ZMI interface and the indexing engine itself.  The ZMI
> > interface, which also defines how ZCTextIndex hooks into ZCatalog,
> > needs to be rewritten from scratch using Zope3 APIs.  The indexing
> > engine itself should require very minor tweaks; it's just a Python
> > application that uses some persistent classes, and doesn't have much
> > Zope-specific code in it.  I expect that the only changes required are
> > changes in the Persistent API.
> 
> Perhaps some thought needs to be applied to how to ensure bugfixes to 
> the core of the code get applied to both Z2 and Z3 versions.

That's a general problem with Zope3 these days.  I don't have a good
suggestion about what to do about this; simply sharing the CVS modules
(like we do successfully between Zope2 and ZODB3) will be
counterproductive, since so much is really different with Zope3, and
any checkin would have to be tested in both contexts.

> > There's a little bit of C code, so a setup.py probably needs to be
> > made.  There may be a few things that aren't worth keeping
> > (e.g. CosineIndex -- I believe OkapiIndex
> 
> Is that some kind of dwarf zebra?
> 
> > is always better).  The
> > RiceCode.py file is what Jim would call a "decoy" -- it's not used.
> 
> Why is it there? I guess I'll find out when I read it.

Jeremy had plans for it.  I don't know the current status of those
plans.

> > An open problem that's not really solved completely in the Zope2
> > case (I believe) is how to extract the indexable text from an
> > object once it is decided that it should be indexed.
> 
> In Zope3, I guess there will be an interface, ISearchableText, that
> defines a getSearchableText() method, and adapters from most content
> types to that interface. (Change spelling according to taste.)

Janko Hauser replied to this:

> Uh, just from the current usage of Z2 indexing, I think it is not
> enough to use a standard method to get the full text indexed content
> of a document type. Think one wants to have to different indices,
> one for anonymous searches and another for managment, where
> different versions of a document or document content parts are
> indexed. We also put keywords in such a full index, so we get all
> relevant documents with one search.

Can you elaborate on the use case of a separate full text index for
management?  If it's a matter of indexing a different set of objects,
that can be solved with a single interface, since the set of objects
indexed is orthogonal to the interface used to extract the text.  If
you actually mean that for management you want to index different
words, I'd like to understand your use case for that first.  (Later,
SteveA responded to this too; I've nothing to add there.)

> I will add this, if the searching use cases are online, but wanted
> to mention it here.

Perhaps someone should start a search use cases wiki page?  I'll
gladly add to it.  I don't know what the customary procedure is.  So
far I've only thought about *writing* use cases, not about how to make
them available. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)