[Zope] attribute used to index PDFs?

Garth B. garthb at gmail.com
Mon Dec 12 11:33:13 EST 2005


TextIndexNG 3.1.1
Zope 2.8.0
Python 2.3.5

What attribute should be specified when indexing PDFs?  I've been
using "data".  Word docs are indexed properly, but the PDFs aren't. 
The PDFs are still found with the rest of the files, but the indexed
content is not what I expected.

To try narrow things down, I set up a seperate test Catalog with only
two PDFs.  The number of distinct values for indexing these PDFs is
around 6600 (which seems a little high for two pdfs with a combined
total of 3 pages).  In the Catalog tab of my test ZCatalog, the PDFs
are listed as type "Unknown".  The content type of these PDFs are set
to "application/pdf'".

(In my other ZCatalog, the PDFs and Word docs are listed as type "File")

This is an excerpt from the vocabulary for "f" in my test Catalog's index:
-------------------------
f
f+æq
f0
f2ök
f5ô
f6
f7ëfü
fa
false
fb8aad1ed82a2cc33e9feb68a3f323
fbt
fc
fd
fdo
fe
fea
feâà
ff
fg
fgiëü
fh
fib
filter
filters
firstchar
fió
fl
flags
flatedecode
fm
fmx
fnaèh
font
fontbbox
fontdescriptor
fontfamily
fontfile2
fontname
fontstretch
fontweight
footlight
format
-------------------------
It looks as though the converter isn't doing its job, or the index
isn't recognizing the files as PDFs  I have manually run pdftotext at
the command line with each of the PDFs to see if pdftotext is having
trouble and it appears to output the textual content properly.  The
TextIndexNG Converters tab does recognize it.  Do I have a
misconfiguration somewhere?

Thanks!


More information about the Zope mailing list