[Zope] attribute used to index PDFs?

Fri Feb 24 10:37:27 EST 2006

Hmm?  I must have missed where it was suggested in this old thread to
enter this "issue" into the bug tracker.  At any rate, what I
eventually concluded was that this really isn't an issue, just a
misconception I had about what TXNG3 actually provides as native
indexing support (given the appropriately installed converters). 
Assuming the user isn't using Plone or something else that provides a
TXNG hook into the File's data, the user still needs to write the
appropriate adapter to get the indexer to pull the raw data from the
object to then be converted and indexed.

This was a bit of a change from what I was used to with TXNG2 which
does know how to pull the data from File objects.  Since I didn't have
enough time to research what was involved in writing an adapter, I
fell back to using TXNG2.  It worked well and accomplished what I
needed.

Garth

On 2/24/06, Andreas Jung <lists at andreas-jung.com> wrote:
>
>
> --On 12. Dezember 2005 14:54:09 -0500 "Garth B." <garthb at gmail.com> wrote:
>
> > On closer inspection, the Word docs aren't actually being indexed
> > appropriately either.  When I browse the vocabulary for these indexed
> > Word docs, I happen to see textual content that can be seen by also
> > cat'ing the document to the stdout.  The vocab includes other strings
> > that certainly are not content.  I guess they're string
> > representations of binary content.
> >
> > These are other things that I noticed, maybe they won't amount to
> > anything:
> >
> > - When I watch the processes during indexing w/top I don't see wvWare
> > or pdftotext appear.  Maybe they won't.
> >
> > - I also inserted a couple of LOG.warn's in src/textindexng/content.py
> > around line 130 (  if d.has_key('mimetype'):  ), and this test always
> > fails, thereby skipping conversion.
> >
> > - Digging further in this file, "mimetype" is only defined when
> > extract_content() in content.py calls "icc.addBinary(...)".  This only
> > happens when the indexed object provides a txng_get() hook (or I
> > suppose if an adapter exists).  That whole block (around lines 81 -
> > 93) never gets hit with my PDFs or Word docs during indexing.  When I
> > index a large number of PDFs I will get a number of TypeErrors raised
> > around line 110 when extract_content() notices that the data isn't a
> > [unicode] string.
> >
> > Is the standard Zope File object supposed to expose a txng_get hook?
> >
> > On 12/12/05, Garth B. <garthb at gmail.com> wrote:
> >> Hi Andreas,
> >>
> >> Neither PrincipiaSearchSource nor SearchableText does anything for
> >> these File-type objects.  I guess nothing for SearchableText is
> >> expected since these are not CMF or Plone-derived objects.  The only
> >> way I've managed to get *anything* indexed for these File-type objects
> >> is by specifying the "data" attribute.
> >>
> >> A couple of related postings that I've found through a bit of Googling
> >> have also noted having to use "data" when indexing these kinds of
> >> files, for example:
> >> http://mail.zope.org/pipermail/zope/2003-August/139702.html
> >>
> >> So, I should be able to use PrincipiaSearchSource?  I've only used
> >> that for text-oriented objects like Page Templates.  I'll keep digging
> >> around, but I welcome any suggestions for what the problem could be or
> >> how I can debug this further.
>
> Maybe you should bring this to TXNG bugtracker (as suggested!).
>
> -aj