PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)

Chris McDonough chrism@digicool.com
Sun, 17 Jun 2001 15:57:20 -0400


On Sun, 17 Jun 2001 21:05:47 +0200 (CEST)
 Erik Enge <erik@thingamy.net> wrote:
> On Fri, 15 Jun 2001, Chris McDonough wrote:
> 
> > Once you're satisfied with the implementation, would
> you be willing
> > submit the module to the collector?
> 
> Do you think you (or someone else for that matter) could
> have a look at
> [1] the method that returns the position in the document
> - positionInDoc()
> - to how that could be made to run much faster?  Maybe it
> is how it
> used...  It is too slow to be very useful when indexing
> large amounts of
> data.

Erik,

It looks like you call proximityInsert for each item
returned from the splitter on the doc source.  Instead of
looking for the position in the source document by splitting
the source up again within proximityInsert, you can keep a
simple counter while you iterate over the splitter return in
index_object, because the splitter return has all the words
in order, even the dupes... as you iterate, you can mutate
the position entry for that word/documentId pair within
proximityInsert.  You never actually need to manually split
the document source, instead just always rely on the
splitter to bust up the doc, and manipulate the position
list in place.  This is not the most efficient way, but it's
more efficient than your current way.

Therefore, the bit in index_object becomes:

i = 0
for word in splitter(source):			
    self.proximityInsert(word, documentId, i)
    i = i + 1

The proximityInsert method becomes:

def proximityInsert(self, word, documentId, i):
    """Insert proximity information about this wid (word id)
in
    the index' proximity bucket."""
    wid=self.getWid(word)
    prox=self._proximity
    if not prox.has_key(wid):
        prox[wid]=IOBTree()
        prox[wid][documentId]=[i]
        self._p_changed = 1
    else:
        if i in prox[wid][documentId]: return
        prox[wid][documentId].append(i)
        self._p_changed = 1

.. and the positionInDoc method goes away.

I didn't scan too hard for what else in the source this
would break.

> Anyway, I suck at making Python fast (or using it the
> right way, which
> ever I've fallen pray for this time ;-), and any hints
> would be greatly
> appretiated.
> 
> I've been indexing and searching a lot this weekend, and
> bar that problem
> with the indexing-speed it seems ok and I have no issues
> submitting it to
> the Collector.

Cool...

> 
> [1] <URL:http://nittin.net/erik/software/PositionIndex/PositionIndex.py>
>