[Zope] How do I index documents containing accents with ZCata log?

Michel Pelletier michel@digicool.com
Thu, 6 Jan 2000 14:18:04 -0500


> -----Original Message-----
> From: Farzad Farid [mailto:farzy@via.ecp.fr]
> 
> Does this mean that the ZCatalog search engine is still missing some 
> of the functionnalities you can find on a search engine like htdig?

htdig does not allow you to search for partial terms efficiently.  In
fact, there are only two 'partial' search types in htdig, and they must
be chosen ahead of time before you index all of your data, these are:

substring 
  Matches all words containing the queries as substrings. 
  Since this requires checking every word in the database, 
  this can really slow down searches considerably. 

prefix 
  Matches all words beginning with the query strings. 
  Uses the option prefix_match_character to decide whether 
  a query requires prefix matching. For example "abc*" 
  would perform prefix matching on "abc" since * is 
  the default prefix_match_character. 

Note that substring matching is totally inefficient and prefix matching
is only partly useful.  See
http://www.htdig.org/attrs.html#search_algorithm

The partial searching I am working on will let you do the following
query forms:

  bob*
  *bob
  bob?
  ?bob

And more, but these 4 are guarenteed in the customer contract.

Only the first type is supported by htdig.  I'm not cracking on htdig,
it's an excellent indexer, but it's just a straight indexer, it does not
have many concessions for partial searching (ie implimentation of an
n-gram type algorithm or similar).  It also supports soundex and
metaphones, which are non-wildcard based 'partial searches', which would
not be hard to add to Zope once I'm done making the changes for partial
searching support.

> Another locale-related important feature is the ability to do searches
> ignoring the difference between accented and non accented
> letters. Suppose a document contains the word "édition", by typing
> "edition" I should be able to find the document, and vice versa. I
> tried and this feature does not seem to work right now.

If you set the local to, say, french, then those two words are
different.  If you want the Catalog to do 'accent folding' then you will
need to subclass your own kind of lexicon and create your own method to
do that.  The problem here is that all of these accents and things are
all very language dependent, there is no way for software to know that
'e' with an accend and 'e' without are sybolicly based on the same
letter, to the computer they are just two completely seperate
characters.  'Case folding' is simply because of the straight, linear,
one-to-one relationship between upper case and lower case letters, there
is no such releationship between accented and unaccented characters,
their relationship is purely human symbolic and possibly one-to-many (e
with no accent, e with the upward accent, e with the downward accent,
etc..).  Someone much more familiar with the vast cornicopia of
languages than I would need to go through and build these relationships
by hand.

> And is the ZCatalog implementation scalable? What happens if I try to
> index and search on a site containing hundreds of thousands of
> documents? 

No problem.  The ZCatalog uses very fast, efficient BTree data
structures written in C.  They scale easily into the hundreds of
thousands, if your hardware can handle it.

> Are the programs optimized not to use hundreds of Megs of
> memory? I've had a bad experience with swish++ which didn't scale when
> indexing 500000 text documents, it tried to use as much memory as it
> could allocate...
> On the other side htdig is well optimized from this point of view.

memory consumption is a constant tradeoff of speed and features.  If you
want partial searching, you will consume more memory (your index
vocabulary will grow larger).  ZCatalog is optimized all over the place
for memory consuption, but you will see significant memory usage, that's
just the way indexers work if you want them to be fast.

If you want to keep memory usage down, avoid mass indexing.  Do not use
the 'Find objects to Catalog' tab on the ZCatalog, create your own types
of objects that incrimentally index themselves when they are
created/edited/deleted.  Mass indexing balloons memory because the
entire index is, pretty much, loaded in memory all at once.

The ZCatalog also uses subtransactions, which means that it can index
document corpuses much larger than available memory.  Of course, as
parts of the index get deactivated to the database and reactivated from
the database, indexing gets much slower.  Any indexing program that is
indexing a half-million documents will need to use hundreds of megs of
*something* to build and store all of those references.  ZCatalog is
nice in that it will use virtual memory if possibly, and database
storage if necessary.

It's true that htdig is 'optimized' for this but it's also true that
htdig is not a constantly running process like Zope is.  Between
queries, htdig keeps all of its index data and meta-data on disk.
Obviously, it is not consuming any memory at these times.  Of course,
you also suffer quite a performance penalty with htdig every time you
use it, because a process must be spawned, data must be read from the
disk and loaded into memory, etc.  Zope does not suffer this performance
penalty because it is a long running process.

-Michel