[Zope3-dev] Sprintathon diary

Fri, 6 Dec 2002 00:04:03 -0500

On Thursday 05 December 2002 03:54 pm, Guido van Rossum wrote:
> Here are my notes from Tuesday through Thursday (focusing mostly on
> Tue-Wed).  Enjoy!
>=20
[snip]
> I still believe that stop word removal
> is a bad idea, so there's no point in optimizing it, and for now the
> Python version of the Okapi inner loop is fine.  Adding machinery for
> building C extensions doesn't seem worth it now.

I think indexing stop words is fine. However, at query-time they are a=20
liability. Google uses the strategy of indexing stop-words but removing t=
hem=20
from the query *unless* the query only contains stop words.

This would require a bit different lexicon pipeline for indexing then for=
=20
querying, or perhaps just a different stopper implementation that can be=20
selectively turned on and off when processing text in the lexicon.
=20
> The TextIndex() class from mhindex.py was promoted, with small
> modifications, to a "wrapper" class in the main TextIndex package.
> The modifications were to support batching more directly, by giving
> query() a start and count argument, and scaling the results to a float
> in the range [0.0, 1.1].  This can't be done as a subclass, because
> the actual indexing class may be configurable: OkapiIndex or
> CosineIndex.  This let me clean up the public API quite a bit.

I'm worried that this might make the index a bit too smart and high-level=
, but=20
I haven't looked at the code yet either. How would this work if you were=20
intersecting a text search with a set from another non-text index (like a=
=20
date range) or sorting using another index? Or maybe you're not thinking=20
about that yet...
=20
> [Tuesday after dinner]
>=20
> One unit test caused problems: it was testing locale-awareness by
> setting a specific locale that apparently doesn't work everywhere, and
> there was a complaint on the Zope3-dev list after dinner.  I guess
> Zope 3 has a much more varied user base than Zope 2; nobody ever
> complained about this for Zope 2, even though it's the same test
> there.  I had fixed this for Mac OSX with a platform check, but a
> better fix is to simply skip the test (silently) when the setlocale()
> call fails.

I just added this test and locale support the other day (I tried it on Li=
nux,=20
FreeBSD and Windows), so the lack of complaints aren't entirely surprisin=
g to=20
me ;^)

I thought about skipping the test, but I wasn't sure the best way to do t=
his,=20
so I figured if it failed somewhere, somebody would point it out and I wo=
uld=20
worry about it then.
=20
> Christian Zagrodnick pointed out a bug in the NBest calculation
> during the writing of the unit tests (which I initially papered over
> by changing the test): when two documents have exactly the same score
> (as they did in the first version of the unit test) a batch of size 2
> returns them in reverse order relative to the input, while a batch of
> size 1 returns the first one.  That breaks my batching algorithm.
> Fortunately I found out that this can be fixed by using bisect_left
> instead of bisect_right!  I had been speculating that it could be
> fixed by changing one comparison from e.g. < to <=3D or vice versa, and
> that's pretty much what this does.  Now docids with equal score are
> always returned in the original sequence order, independent of batch
> size.

That's good to know since I just put NBest into the Zope 2 ZCatalog sort=20
algorithm...
=20
> Lesson learned: add more unit tests.  We'll learn that over and over.
>=20
> To make sure it really worked, I banged on mhindex.py until I could
> index my inbox and query against it.  This was pretty straightforward
> -- the only thing that took time was to figure out where
> get_transaction() was.  It must now be imported from Transaction.
>=20
> [Wednesday morning]
>=20
> Overnight, I had figured out that there was still a bug in the Unicode
> support of the splitter pipeline, which could be fixed by using (?u)
> instead of (?L) -- i.e. re.UNICODE instead of re.LOCALE.  Christian Z
> helped me produce a nice Unicode unit test using \N{...}.

We should backport this to the Zope 2(.7) ZCTextIndex.

[snip amazing detail, puff...puff...]

Thanks for the great report. I've been knee deep back home on ZCatalog an=
d its=20
nice to see the next generation shaping up. OTOH nobody can tell me that=20
ZCatalog is complicated (in code or ui) compared to this (well, you can b=
ut I=20
won't believe you ;^)...

-Casey