PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)

Rik Hoekstra rik.hoekstra@inghist.nl
Mon, 18 Jun 2001 12:33:25 +0200


> 
> > Once you're satisfied with the implementation, would you be willing
> > submit the module to the collector?
> 
> Do you think you (or someone else for that matter) could have a look at
> [1] the method that returns the position in the document - positionInDoc()
> - to how that could be made to run much faster?  Maybe it is how it
> used...  It is too slow to be very useful when indexing large amounts of
> data.
> 
> Anyway, I suck at making Python fast (or using it the right way, which
> ever I've fallen pray for this time ;-), and any hints would be greatly
> appretiated.
> 
> I've been indexing and searching a lot this weekend, and bar that problem
> with the indexing-speed it seems ok and I have no issues submitting it to
> the Collector.
> 
Doing something similar (in fact what I needed was citations of word usage) I
took a two step approach, with the idea that most of the actual returning of
results would have to be done on a much smaller subset of documents than if
you'd have to index all documents with word indexes and positions.

I use a normal textindex for querying. Then if a document is returned by the
query I start processing the documents. This requires parsing the query in a
slightly different way (throw out the NOTs). The two step approach has the
advantage that you can postpone processing actual documents until you return the
results for the specific documents. 

Using your positionInDoc will require a _lot_ of processing (why does it use
string.split btw and not Splitter?; why split on " " and not on
string.whitespace?). I have used string.find for finding word positions, which
is probably faster than looping a list of words. BTW, I'd rather use Splitter,
but word positions appeared not to be reliable (bug, or something I didn't
understand; anyhow, string.find works for me and is fast)

def splitit(txt, word):
        postions = []
        start = 0
        while 1:
              res = string.find(txt, word, start)
              if res is -1:
                      break
              else:
                      start = res+1
                      postions.append(res)
        return postions


<sidenote>Perhaps using re would perhaps also be an option, but allowing regular
expressions will complicate searching a lot, so I use globbing lexicon for
expanding and then do the matching on the expanded items (if necessary - not if
using [wordpart]*)</sidenote>

Advantages of using this approach:
- it's faster. 
- it splits up the query processing part in different subparts which also
contributes to speeding things up. 
- it's also more flexible, as you can divide searching and parsing over
different webrequests, and even make them dependend on the number of results.
For example: why return text fragments from all documents if your users will not
be able to see all the results anyway. Or why return all fragments containing
word combinations from one single document while returning a few occurrences
from different documents is more useful for your users. Note that this will
mainly affect returning text fragments, which may or may not be useful.

There's also a couple of disadvantages (as I see them , but there may be more):
- it only works with exact word positions and not numbers in a text. The within
two words approach may be remedied by using string.split on substrings however
if really needed. Depending on you purposes an even rougher approach is by
taking some default length for words (this is a bit faster). These are not very
elegant solutions, though.
- because of an approach that is not so coupled with (Z)Catalog, integration
strategies are less obvious (at least for me)
- the positionIndex might be used for further processing as is, in my approach
this is less obvious.


another 2 cents

Rik