[Zope] indexing pdf files

Kapil Thangavelu kthangavelu@earthlink.net
Thu, 31 Aug 2000 17:51:24 -0700

Terry Kerr wrote:
> Hi,
> I need to be able to index the text within pdf files.  I assume I will
> somehow use PrincipiaSearchSource, but I need to know how to get the
> text out of the pdf when it is uploaded to the ZODB.  Has anyone done
> this before?  Are there any packages around that I can use that run in
> python or at least on a linux box that I can pipe to and from?
> terry

from xml2pdf there are a multitude of ways in python

XSLT - check out the ibm.com/developer xmlzone they have an article in
the education lib for transforming xml to pdf.

platypus packages from

they might give you some help in going the other way..

as for implementation... 

looking at a pdf in a text viewer it appears to be formating text and
encoded display strings. 

you could write a subclass of file, which read its content upon upload
stripping the formatting string and decoding the display strings and
storing that as a property to be indexed.