[Zope3-dev] KeywordIndex

Mon Jul 18 11:42:36 EDT 2005

On Jul 18, 2005, at 11:14 AM, Jeff Shell wrote:

> I'm working on a simple application which is the first time I get to
> use the catalog in Zope 3. I'm writing against Zope 3.1b1. I was
> dismayed not to see KeywordIndex in the main catalog set, but then I
> found it in zope.index.keyword. But it seems to be a bit behind.

Hi.  Yes, we needed it too.  Here's another thing we want to open  
source.  Look at the attached .txt file; if you want it then tell me  
and I'll make it available in a sandbox.  We'll move it over into the  
Zope repo (probably with a new name, or rearranged on the appropriate  
locations (zope.index and zope.app.catalog, etc.) RSN.

Downsides:

- Note that some functionality requires that you use an extent  
catalog, another goodie in the package.

- We have some refactoring of this that we want to do.  We'll have  
legacy issues ourselves, then.

Additional upside:

-  This package also includes a replacement for the field index  
(called a value index) and customizations of the value and set  
indexes specific to timezone-aware datetimes, as well as a few other  
things.

Gary

-------------- next part --------------
The setindex is an index similar to, but more general than a traditional
keyword index.  The values indexed are expected to be iterables; the index
allows searches for documents that contain any of a set of values; all of a set
of values; or between a set of values.

Additionally, the index supports an interface that allows examination of the
indexed values.

It is as policy-free as possible, and is intended to be the engine for indexes
with more policy, as well as being useful itself.

On creation, the index has no wordCount, no documentCount, and is, as
expected, fairly empty.

    >>> from zc.catalog.index import SetIndex
    >>> index = SetIndex()
    >>> index.documentCount()
    0
    >>> index.wordCount()
    0
    >>> index.maxValue() # doctest: +ELLIPSIS
    Traceback (most recent call last):
    ...
    ValueError:...
    >>> index.minValue() # doctest: +ELLIPSIS
    Traceback (most recent call last):
    ...
    ValueError:...
    >>> list(index.values())
    []
    >>> len(index.apply({'any_of': (5,)}))
    0

The index supports indexing any value.  All values within a given index must
sort consistently across Python versions.  In our example, we hope that strings
and integers will sort consistently; this may not be a reasonable hope.

    >>> data = {1: ['a', 1],
    ...         2: ['b', 'a', 3, 4, 7],
    ...         3: [1],
    ...         4: [1, 4, 'c'],
    ...         5: [7],
    ...         6: [5, 6, 7],
    ...         7: ['c'],
    ...         8: [1, 6],
    ...         9: ['a', 'c', 2, 3, 4, 6,],
    ... }
    >>> for k, v in data.items():
    ...     index.index_doc(k, v)
    ...

After indexing, the statistics and values match the newly entered content. 

    >>> list(index.values())
    [1, 2, 3, 4, 5, 6, 7, 'a', 'b', 'c']
    >>> index.documentCount()
    9
    >>> index.wordCount()
    10
    >>> index.maxValue()
    'c'
    >>> index.minValue()
    1
    >>> list(index.ids())
    [1, 2, 3, 4, 5, 6, 7, 8, 9]

The index supports five types of query.  The first is 'any_of'.  It
takes an iterable of values, and returns an iterable of document ids that
contain any of the values.  The results are weighted.

    >>> list(index.apply({'any_of':('b', 1, 5)}))
    [1, 2, 3, 4, 6, 8]
    >>> list(index.apply({'any_of': ('b', 1, 5)}))
    [1, 2, 3, 4, 6, 8]
    >>> list(index.apply({'any_of':(42,)}))
    []
    >>> index.apply({'any_of': ('a', 3, 7)})
    BTrees._IFBTree.IFBucket([(1, 1.0), (2, 3.0), (5, 1.0), (6, 1.0), (9, 2.0)])

Another query is 'qny', If the key is None, all indexed document ids with any
values are returned.  If the key is an extent, the intersection of the extent
and all document ids with any values is returned.

    >>> list(index.apply({'any': None}))
    [1, 2, 3, 4, 5, 6, 7, 8, 9]

    >>> from zc.catalog.extentcatalog import FilterExtent
    >>> extent = FilterExtent(lambda extent, uid, obj: True)
    >>> for i in range(15):
    ...     extent.add(i, i)
    ...
    >>> list(index.apply({'any': extent}))
    [1, 2, 3, 4, 5, 6, 7, 8, 9]
    >>> limited_extent = FilterExtent(lambda extent, uid, obj: True)
    >>> for i in range(5):
    ...     limited_extent.add(i, i)
    ...
    >>> list(index.apply({'any': limited_extent}))
    [1, 2, 3, 4]

The 'contains_all' argument also takes an iterable of values, but returns an
iterable of document ids that contains all of the values.  The results are not
weighted.

    >>> list(index.apply({'all_of': ('a',)}))
    [1, 2, 9]
    >>> list(index.apply({'all_of': (3, 4)}))
    [2, 9]

The 'between' argument takes from 1 to four values.  The first is the 
minimum, and defaults to None, indicating no minimum; the second is the 
maximum, and defaults to None, indicating no maximum; the next is a boolean for
whether the minimum value should be excluded, and defaults to False; and the
last is a boolean for whether the maximum value should be excluded, and also
defaults to False.  The results are weighted.

    >>> list(index.apply({'between': (1, 7)}))
    [1, 2, 3, 4, 5, 6, 8, 9]
    >>> list(index.apply({'between': ('b', None)}))
    [2, 4, 7, 9]
    >>> list(index.apply({'between': ('b',)}))
    [2, 4, 7, 9]
    >>> list(index.apply({'between': (1, 7, True, True)}))
    [2, 4, 6, 8, 9]
    >>> index.apply({'between': (2, 6)})
    BTrees._IFBTree.IFBucket([(2, 2.0), (4, 1.0), (6, 2.0), (8, 1.0), (9, 4.0)])

The 'none' argument takes an extent and returns the ids in the extent
that are not indexed; it is intended to be used to return docids that have
no (or empty) values.

    >>> list(index.apply({'none': extent}))
    [0, 10, 11, 12, 13, 14]

Trying to use more than one of these at a time generates an error.

    >>> index.apply({'all_of': (5,), 'any_of': (3,)})
    ... # doctest: +ELLIPSIS
    Traceback (most recent call last):
    ...
    ValueError:...

Using none of them simply returns None.

    >>> index.apply({}) # returns None

Invalid query names cause ValueErrors.

    >>> index.apply({'foo':()})
    ... # doctest: +ELLIPSIS
    Traceback (most recent call last):
    ...
    ValueError:...

When you unindex a document, the searches and statistics should be updated.

    >>> index.unindex_doc(6)
    >>> len(index.apply({'any_of': (5,)}))
    0
    >>> index.documentCount()
    8
    >>> index.wordCount()
    9
    >>> list(index.values())
    [1, 2, 3, 4, 6, 7, 'a', 'b', 'c']
    >>> list(index.ids())
    [1, 2, 3, 4, 5, 7, 8, 9]

Reindexing a document that has new additional values also is reflected in 
subsequent searches and statistic checks.

    >>> data[8].extend([5, 'c'])
    >>> index.index_doc(8, data[8])
    >>> index.documentCount()
    8
    >>> index.wordCount()
    10
    >>> list(index.apply({'any_of': (5,)}))
    [8]
    >>> list(index.apply({'any_of': ('c',)}))
    [4, 7, 8, 9]

The same is true for reindexing a document with both additions and removals.

    >>> 2 in set(index.apply({'any_of': (7,)}))
    True
    >>> 2 in set(index.apply({'any_of': (2,)}))
    False
    >>> data[2].pop()
    7
    >>> data[2].append(2)
    >>> index.index_doc(2, data[2])
    >>> 2 in set(index.apply({'any_of': (7,)}))
    False
    >>> 2 in set(index.apply({'any_of': (2,)}))
    True

Reindexing a document that no longer has any values causes it to be removed
from the statistics.

    >>> del data[2][:]
    >>> index.index_doc(2, data[2])
    >>> index.documentCount()
    7
    >>> index.wordCount()
    9
    >>> list(index.ids())
    [1, 3, 4, 5, 7, 8, 9]

This affects both ways of determining the ids that are and are not in the index
(that do and do not have values).

    >>> list(index.apply({'any': None}))
    [1, 3, 4, 5, 7, 8, 9]
    >>> list(index.apply({'none': extent}))
    [0, 2, 6, 10, 11, 12, 13, 14]

The values method can be used to examine the indexed values for a given 
document id.

    >>> set(index.values(doc_id=8)) == set([1, 5, 6, 'c'])
    True

And the containsValue method provides a way of determining membership in the
values.

    >>> index.containsValue(5)
    True
    >>> index.containsValue(20)
    False