[ZCM] [ZC] 1815/28 Comment "ZCTextIndex: index_object() can no longer handle attributes returning lists of strings"

Tue Jul 12 06:22:40 EDT 2005

Issue #1815 Update (Comment) "ZCTextIndex: index_object() can no longer handle attributes returning lists of strings"
 Status Pending, Catalog/bug+solution medium
To followup, visit:
  http://www.zope.org/Collectors/Zope/1815

==============================================================
= Comment - Entry #28 by sacco on Jul 12, 2005 6:22 am

...and also applies cleanly to 2.7.7-final.  What's happening guys?
________________________________________
= Comment - Entry #27 by yuppie on Jul 4, 2005 2:53 pm

I'll merge r30995 into the Zope-2_8-branch. (My last change broke the patch and I did have to resolve the conflict on the trunk anyway.)
________________________________________
= Comment - Entry #26 by ajung on Jul 4, 2005 2:00 pm

Applied patch to the SVN trunk but it does work witih the 2.8 branch....please submit a fixed version for the 2.8 branch.

________________________________________
= Comment - Entry #25 by sacco on Jul 4, 2005 12:23 pm

...and to 2.7.7-b1
________________________________________
= Comment - Entry #24 by sacco on Jul 4, 2005 12:00 pm

... forgot to mention, the patch applies cleanly to the current release: Zope-2.8.0-final

________________________________________
= Comment - Entry #23 by sacco on Jul 4, 2005 11:35 am

Uploaded:  "ZCTextIndex.patch"
 - http://www.zope.org/Collectors/Zope/1815/ZCTextIndex.patch/view

________________________________________
= Comment - Entry #22 by sacco on Jul 4, 2005 11:33 am

ok.
The next Followup will be a patch intended to address the issues raised here.

The changes are relatively minor --- to retain something of the 'spirit' of the previous version, I've opted for the single call mentioned in Entry #15, but the patched version includes a comment indicating how the other alternative could be chosen (do feel free to remove the comment if it looks too much like clutter!).

I've also added to the docstring of  index_doc() , as suggested by tim_one .

As Andreas notes in Entry #17, the unit tests are somewhat mixed together here, so I have limited the patch to using a slightly modified copy of the test for multiple attributes to provide basic verification of the desired functionality.

________________________________________
= Comment - Entry #21 by sacco on Jul 1, 2005 5:16 am

Any opinions/replies to Enty #15 , or shall I just post a cleaned up diff of what I'm using here?

________________________________________
= Comment - Entry #20 by tim_one on Jun 29, 2005 11:29 am

OK, so there is no formal interface (apart from the one ZCTextIndex introduced) -- your base case is other implementations of indexing.  That's fine, now that I understand what you're talking about.

I already explained ZCTextIndex's intents in comment #8, so won't repeat that again (but will repeat, with greater urgency now that I know there is no other formal interface, that whatever gets decided here should get spelled out explicitly in index_doc's interface docstring).
________________________________________
= Comment - Entry #19 by ajung on Jun 29, 2005 11:05 am

TextIndex, TextIndexNG and the TextIndex of ManageableIndexes all expect a string or a unicode string but nothing else. This is basicially what I understand of a text index (and obviously the authors of these indexes).  Since you haven't written any unittests
for sequence of strings it is not obvisious that accepting a sequence of string was intentional. 
________________________________________
= Comment - Entry #18 by tim_one on Jun 29, 2005 10:59 am

Andreas, when you say "The extension of accepting sequences of strings ...", it's an extension with respect to WHAT?  I apologize if I'm being dense, but if there isn't a formal interface defining the indexing API, I have no idea what your base case is.  What is it, specifically, that ZCTextIndex extended?

You must know a lot more about the history here than I do, and I suspect that's why I have no idea what you're talking about.  I did some work on ZCTextIndex, but that's where my history started -- I had no awareness at the time that ZCTextIndex was extending anything.  It was a from-scratch project when I worked on it.
________________________________________
= Comment - Entry #17 by ajung on Jun 29, 2005 10:38 am

The extension of accepting sequences of strings should have been tested and manifested as intended by unittests. Since my changes did not break existing unittests I could not recognize any problem - especially ZCTextIndex has a lot of unittests testing all and everything.  So if you are free to fix it but you are far away from having convinced me. To bring it to the point: indexes provding the same functionality should behave in the same way - especially when you want to replace one by another. So I still consider this behaviour stupid.
________________________________________
= Comment - Entry #16 by tim_one on Jun 29, 2005 10:26 am

[Andreas]
> I am pretty much against the "freedom" to add
> additional behaviour for indexes.

Andreas, where can I find the interface for indexes?  Or isn't there one?  If there isn't, then we're just talking about code's observed behavior, and AFAICT the only methods with name `index_doc` in the Zope distribution come from ZCTextIndex, and all of them acted the same way.

As noted before, ZCTextIndex does define formal interfaces, but its `index_doc` definition is silent about the point in question (although related methods' interface definitions are clear).

Unless you have some other formal interface in mind, I don't know what the "base case" might be when you talk about "adding" additional behavior.  That is, adding with respect to what?  If there is a formal interface, then the answer to that would be clear.
________________________________________
= Comment - Entry #15 by sacco on Jun 29, 2005 9:57 am

> = Comment - Entry #13 by ajung on Jun 29, 2005 9:29 am
> 
> I won't hinder you to fix it (I won't fix it) and re-introduce 
> the desired behaviour. I still consider the desired behaviour a 
> flaw. For me the current implementation is current since the 
> unittests pass.  If you fix it yourself then don't forget a
> unittest for the *new* behaviour.

OK.  Two questions:

1)  Would you consider it better to make a single call to  index_doc()  
with the combined values of all the attributes to be indexed, 
or to make a separate call for each attribute.
(i.e. what is your position on what I described as Issue 2 A) in Comment entry # 12 vs. saving some function calls?)?

I tend towards the former (i.e. saving the function calls), but don't have a strong preference.

2) Do we prefer unified or context diffs?  Do I post the diff here?
(I'm new here).

Thanks,
tim

________________________________________
= Comment - Entry #14 by sacco on Jun 29, 2005 9:38 am

Issue 2:
Changed semantics of the function index_object()

Aspect B:

The values of the attributes to be indexed are now combined into a single string: to me this _does_ appear to be a serious probem.

The function of a splitter is to divide its input into terms to be indexed, and this splitting is performed on the basis of some inbuilt semantic knowledge of the type of text under consideration.

In the simplest case this 'knowledge' is just that the splitter assumes that the terms to be indexed are words divided by white space (and so just crudely splits the strings on white space), but it should be easy to construct better ways of splitting on a more sophisticated basis.
For example a better splitter might look for compound terms (possibly from a pre-compiled lexicon), and so be able to generate an index entry for the term "white space" separate from lone occurrences of the words "white" and "space".

( Conversely, if we wish to combine text in this context we really require a specific Joiner adjoint to the specific splitter. (I won't explain this further as it's not a crtical part of the argument here, but if the comment isn't clear, consider why the existing patch joins with a space ' '.join() and not with an 
empty string ''.join() , or whether joining on a space would be appropriate when indexing a repository of python code... )

The problem, however, is not just that a crude ' '.join() may not respect the semantic basis on which the (unknown) splitter operates, but that it also changes the semantics of the  index_text()  function in a way which can cause the generation of garbage with any but the crudest splitters.

For example, the  fulltext  attribute of an article on simian literacy in the Zope zoo might return a list containing the title and text of an article thus:

["Training the Monkey", 
"Patching holes in the fence against marauding wildebeeste is Arnold Schwazenegger's day-job, 
but in the evenings he relaxes by teaching the works of Fletcher and Marlowe to baboons.  
Even his best friends think he's potty."]

If these two strings are (as they currently are) joined before being passed to  index_doc() , and even the most inteligent term splitter would now have no way to avoid generating a totally spurious reference to a non-existent occurence of the term 
"monkey patching".

(For those who are really not happy with the idea of an attribute returning a list of strings,  assume these are two attributes, 'title' and 'abstract', each returning one string: the situation is now even worse, as the spurious reference could be to either of "monkey patching", or "potty training", depending on the (possibly 
rather random and certainly implementation dependent) order in which the attributes are considered.)

________________________________________
= Comment - Entry #13 by ajung on Jun 29, 2005 9:29 am

I won't hinder you to fix it (I won't fix it) and re-introduce the desired behaviour. I still consider the desired behaviour a flaw. For me the current implementation is current since the unittests pass.  If you fix it yourself then don't forget a unittest for the *new* behaviour.

________________________________________
= Comment - Entry #12 by sacco on Jun 29, 2005 9:29 am

Issue 2: 
The last patch changes the semantics of the function  index_object()

There are two aspects to this:

A)

As I hinted in my first message on the subject:
("...unless there was some other reason why the original _index_object was used to make each call to  index_doc()  separately"),
the values of all attributes to be indexed are now combined and passed together in a single call to index_doc() .

Does this matter?  I think probably not, but it is certainly possible to imagine a processing pipeline which considers each term in the context  (now different) of the combined input which it has been passed, so there is an issue here to consider and to document.

________________________________________
= Comment - Entry #11 by sacco on Jun 29, 2005 9:26 am

There are two other issues here:

1)  Efficiency:

The argument to  index_doc()  is to be processed by a Splitter (which, as tim_one has noted, will accept a list of strings), the function of which is to divide its input into a list of terms to be indexed.

What is the sense of taking a list of strings (in which some high-level divisions are already present) and (wasting effort) joining them together, thus forcing the splitter to re-find (among others) the divisions that were already present in the original list?

________________________________________
= Comment - Entry #10 by sacco on Jun 29, 2005 9:15 am

index_doc() _does_ accept a list of strings, and, as far as I can tell, always has.

This means that joining the arguments passed to it into a single string (and this join is responsible for half the problems breaking other products) is unnecessary, and that the ' '.join() cannot achieve anything useful.  (Flattening, if necessary, the list of values into a single level list would be more useful.)

On the other hand, this is not a case of "freedom" to add additional behaviour, but rather a case of removing a reasonable behaviour which has already been in use since at least a couple of years ago.

index_object()  has a docstring which descibes it as simply a 
"wrapper to handle indexing of multiple attributes":
it's function is therefore just to collect the values of the attributes to be indexed and to pass them to  index_doc()

Indeed, the former version of  index_object()  was clearly intended (but for a bug) to do exactly this, and was completely agnostic about the types of the attributes.  (This is not necessarily the approach I would choose, but does seem very "pythonic".)
I'm certainly not suggesting that index_object() shouldn't be allowed to do anything intelligent with the attribute values (or even check that they are in an acceptable form for index_doc() ).

It is surely, however, not desirable for a simple wrapper function to disallow values which are perfectly acceptable (and indeed normal) for the function it wraps, especially when this breaks compatibility with reasonably mature Products which depend on using such arguments .

________________________________________
= Comment - Entry #9 by ajung on Jun 29, 2005 3:00 am

I am pretty much against the "freedom" to add additional behaviour for indexes. Text indexes should index text. Any text index should behave in the same way especially when you want to exchange an text index implementation with another one. Since we have at least four different text index implementations in the Zope world, the implementation should behave in the same way *and* applications should *not* depend on some behaviour extension.

________________________________________
= Comment - Entry #8 by tim_one on Jun 28, 2005 12:29 pm

These things are certainly true:

1. mhindex.py does pass a list of strings to index_doc().

2. Guido wrote mhindex.py, and indeed used ZCTextIndex to index and search his own email (mh was his email system at the time).

3. Guido intended that index_doc() accept a string or a list of strings.  The code in mhindex.py illustrates why:  sometimes a list of strings is the _natural_ outcome from parsing a document, and it's much more expensive (in both time and space) to force the user to concatenate all the pieces into a single giant string.  So this is at least as much for efficiency as for convenience.

OTOH, it's also true that ZCTestIndex's IIndex interface doesn't really say anything about the expected type of index_doc's `text` argument.  The interface should be clarified no matter how this is resolved.

Note that ZCTextIndex's ILexicon interface is very clear that the `text` argument in sourceToWordIds(text) is "either a string or a list of strings".  Since the only callers of sourceToWordIds() are methods like index_doc(), it's most reasonable to conclude from that too that index_doc's `text` argument was intended to be the same.

So I think it would be best to restore that ZCTextIndex's index_doc() accepts a list of strings, and that the ZCTextIndex's IIndex interface be made clearer in any case.
________________________________________
= Comment - Entry #7 by sacco on Jun 28, 2005 11:58 am

> I could not see at the first glance where lists of strings are 
> used in tests. Since all tests pass I assume that the intended 
> behaviour of the API is working and tested.
> 

The functional test  mhindex.py  which I referred to previously is in Products/ZCTextIndex/tests and its action is described in  README.txt as to "index and query a set of MH folders".

In the function  updatefolder()  of the class  Indexer :

            text = self.getmessagetext(m, f.name)
            if not text:
                self.unindexpath(path)
                continue
            print "indexing", docid, path
            self.index.index_text(docid, text)

where  'index.text(docid, text)'  calls  index_doc()  with the output of the following function (which is clearly a list):

    def getmessagetext(self, m, name=None):
        L = []
        if name:
            L.append("_folder " + name) # To restrict search to a folder
            self.getheaders(m, L)
        try:
            self.getmsgparts(m, L, 0)
        except KeyboardInterrupt:
            raise
        except:
            print "(getmsgparts failed:)"
            reportexc()
        return L

    def getmsgparts(self, m, L, level):
        ctype = m.gettype()
        if level or ctype != "text/plain":
            print ". "*level + str(ctype)
        if ctype == "text/plain":
            L.append(m.getbodytext())
        elif ctype in ("multipart/alternative", "multipart/mixed"):
            for part in m.getbodyparts():
                self.getmsgparts(part, L, level+1)
        elif ctype == "message/rfc822":
            f = StringIO(m.getbodytext())
            m = mhlib.Message("<folder>", 0, f)
            self.getheaders(m, L)
            self.getmsgparts(m, L, level+1)
 .
 .
 .
etc.

Indeed, when this function is tested on an MH folder using a version of  index_doc()  which has been modified as follows it fails in the expected way:

    def index_doc(self, docid, text):
# Fail if argument is a list
        if isinstance(text, list):
            raise TypeError
#
        if self._docwords.has_key(docid):
            return self._reindex_doc(docid, text)
        wids = self._lexicon.sourceToWordIds(text)
 .
 .
 .

I hope this is a suffuciently clear and immediate explanation.
Can we assume from this that  index_doc()  is intended to accept _either_ a list of strings _or_ a single string?

________________________________________
= Edit - Entry #6 by ajung on Jun 24, 2005 12:44 pm

 Changes: submitter email, edited transcript, importance (critical => medium)
________________________________________
= Comment - Entry #5 by ajung on Jun 24, 2005 12:43 pm

On the subject of the tests: 

>There is a file mhindex.py in the ZCTextIndex/tests directory whose use is >presumably to test an index by indexing MH mail folders. 
>
>This test calls index_doc() with a list of strings, and would certainly fail unless >index_doc() accepts a list of strings. 

I could not see at the first glance where lists of strings are used in tests. Since all tests pass I assume that the intended behaviour of the API is working and tested.

>I have also now checked all the 2.7 and 2.8 versions of Zope that I can find, and >every implementation of index_doc() _does_ actually accept a list of strings. 

You have to look at *text index* implementation. If you look carefully at the very old TextIndex code then you will see no code that supports lists of strings. This old code just calls source=str(source) which is unlikely an indication that you can pass anything. Once again: any text index expects text to be indexed - either by Python string or Python unicode string.

So for me this issue is a clear reject...any other opinions?

________________________________________
= Comment - Entry #4 by sacco on Jun 22, 2005 6:38 am

On the subject of the tests:

There is a file  mhindex.py  in the ZCTextIndex/tests  directory whose use is presumably to test an index by indexing MH mail folders.

This test calls  index_doc()  with a list of strings, and would certainly fail unless  index_doc()  accepts a list of strings.

I have also now checked all the 2.7 and 2.8 versions of Zope that I can find, and every implementation of  index_doc()  _does_ actually accept a list of strings.

________________________________________
= Comment - Entry #3 by sacco on Jun 21, 2005 1:32 pm

The query arises because:

1) Prior to this everything appears to work with a list of strings rather than a single string, indeed other products (including SilvaDocument) rely on this behaviour.

2) There is nothing else that I can see that would prevent the use of a list of strings (although, as I pointed out, I'm certainly no expert on these modules).

3) In   ZCTextIndex , the "text" argument given to index_doc() with which to index the document is passed through the following helper function (in Lexicon.py):

def _text2list(text):
    # Helper: splitter input may be a string or a list of strings
    try:
        text + ""
    except:
        return text
    else:
        return [text]

Thus, the subsequent stages of the indexing process actually require a list of strings, and any single string submitted will in fact be wrapped in a list.  
It certainly suggests at least an intention to permit a list of strings.

Whether this is an 'unofficial' extension of the expected behaviour I don't know, as I haven't been able to find any documentation describing what is expected.

> = Comment - Entry #2 by ajung on Jun 21, 2005 12:26 pm
> 
> """
> which breaks if any  attr  has returned a list of strings, behaviour
>   which seems to be allowed for by the index_doc() function called.
> """
> 
> Any text index in Zope excepts a string a unicode string for indexing but
>   not a list of something. Any other behaviour is likely unintentional.
>   Since there are no unittests for the requested behaviour I do except that
>   your assumption is wrong. The change runs fine with the unittests so the
>   index works as expected for me. So any other interpretation than
>   expecting a string und unicode string does not make sense to me at all

________________________________________
= Comment - Entry #2 by ajung on Jun 21, 2005 12:26 pm

"""
which breaks if any  attr  has returned a list of strings, behaviour which seems to be allowed for by the index_doc() function called.
"""

Any text index in Zope excepts a string a unicode string for indexing but not a list of something. Any other behaviour is likely unintentional. Since there are no unittests for the requested behaviour I do except that your assumption is wrong. The change runs fine with the unittests so the index works as expected for me. So any other interpretation than expecting a string und unicode string does not make sense to me at all
________________________________________
= Request - Entry #1 by sacco on Jun 21, 2005 12:02 pm

ZCTextIndex/ZCTextIndex.py was updated to fix handling of multiple attributes (Collector #1784), checked in by Andreas Jung 2005-05-17.

Various changes were made to index_object, although it looks as if changing one line in _index_object() would have been enough:

-        text = getattr(obj, self._fieldname, None)
+        text = getattr(obj, attr, None)

However, the new version ends of index_object() with:

    if all_texts:
        return self.index.index_doc(documentId, ' '.join(all_texts))

which breaks if any  attr  has returned a list of strings, behaviour which seems to be allowed for by the index_doc() function called.

I suspect that what was intended was something more like the following (but I'm just typing this into the browser here, so it should be checked):

    if all_texts:
        all_text = list()
        for txt in all_texts:
            if isinstance(txt, (list, tuple, )):
                all_text.extend(txt)
            elif isinstance(txt, basestring):
                all_text.append(txt)

        return self.index.index_doc(documentId, all_text)

...unless there was some other reason why the original _index_object was used to make each call to index_doc() separately (I don't know the modules well enough to have an opinion here).

==============================================================