[Zope] ZCTextIndex - prefix wildcards not supported?

Small Business Services toolkit at magma.ca
Mon Jun 21 09:49:00 EDT 2004


Hi Casey,

I am trying to implement your suggestion of accessing the '_docwords'
structure in an attempt to eliminate duplicate storage of data in the
ZCatalog.

I have created a test external method to retrieve the _docwords entry for a
specific object in an existing ZCatalog:

def jtmp(self):
   res = self.Catalog({'id' : '1086793690.85'})
   for item in res:
      rid = item.data_record_id_
   return
self.Catalog.getIndex('all_searchable_text').getEntryForObject(rid)


Executing this external method gives me a zope error:

Traceback (innermost last):
  Module ZPublisher.Publish, line 98, in publish
  Module ZPublisher.mapply, line 88, in mapply
  Module ZPublisher.Publish, line 39, in call_object
  Module Products.ExternalMethod.ExternalMethod, line 224, in __call__
   - __traceback_info__: ((<Folder instance at a063d58>,), {}, None)
  Module /apps/zope/Extensions/jtmp.py, line 13, in jtmp
AttributeError: getIndex

I am confused (being a relative python newbie) because 'getIndex' and
'getEntryForObject' are functions defined within the Catalog class, so
shouldn't they be available?!

Is there a better way to go about this?

Thanks,

Jonathan


----- Original Message -----
From: "Casey Duncan" <casey at zope.com>
To: "Small Business Services" <toolkit at magma.ca>
Sent: November 21, 2003 4:28 PM
Subject: Re: [Zope] ZCTextIndex - prefix wildcards not supported?


> On Fri, 21 Nov 2003 14:08:08 -0500
> "Small Business Services" <toolkit at magma.ca> wrote:
>
> > The Zope Cache size is set at 10,000
> >
> > There are 1,985,183 objects in the 'database'
>
> Hmm, that's less then I would have thought.
>
> > Specifications for our update linux box:
> >
> >    Zope 2.6.1
> >    1 ghz PIII
> >    1.25 Gb RAM (pc133)
> >    3 disks (IBM ultrastar, scsi, ultra2mode - 10,000 rpm, 4.5ms access)
> >
> > We are running the disks striped on a single controller, which gives us
> > amazing read/write capacity.  We rarely run at full capacity on the
disks.
> > We set the cache at the highest point possible (any higher and the
machine
> > swaps itself to death).
>
> I think you could definitely use more RAM. But that is a given pretty
much. How big is the Data.fs file when you're through indexing? How does
that compare to the size of the document corpus itself?
>
> Also I think you may want to try Zope 2.6.2. I made some changes to
ZCTextIndex in that version that could help performance. I would be
interested to hear if they help.
>
> [snip]
> > We eventually came up with our current solution: at index time we
compress
> > the full-text and store it as binary data in the metadata table (getting
> > this to work was a challenge in itself).  We then decompress and scan
this
> > data to locate the relevant 2-3 lines at retrieval time (it is far
faster to
> > decompress & scan metadata then to access the objects directly).
>
> Using metadata tends to wake up far fewer objects, which can be a win.
Interestingly ZCTextindex actually stores a similar compressed word list
internally. The actual index object stored in ZCTextIndex has an _docwords
BTree which stores a compressed wordlist for each document. This is used for
unindexing and phrase matching. Look at the search_phrase method in
BaseIndex.py for for info.
>
> If you could use _docwords, you might be able to get rid of that redundant
data structure and the time it takes to build and store it. Retrieval time
should be on par with metadata.
>
> > Retrieval speeds for end users are excellent.  We have only been running
> > into difficulties lately because of the size of the database.  The
update
> > process now runs 24 hours per day for about 30 days (automating an
update
> > process that runs for 30 days was another exciting challenge!).  The
fact
> > that zope can handle this volume of processing is a testament to its
> > reliability and robustness!
>
> I'm concerned that it takes that long to index. 30 days is like a
millenium of processor time. I'm curious how big your transactions are
during index processing.
>
> I'm glad to see the retreival speeds are good. What roughly is the average
document size?
>
> > We have been working with Zope for about 3 years and think that it is a
> > FANTASTIC product!  We keep coming up with new things to use it for, its
> > great!
> >
> > Thanks in advance for any ideas you may have - we are open to any and
all
> > suggestions!
>
> Sounds like you have a very interesting application. I'd be very
interested to hear about and possibly try to help make it faster if I can.
>
> -Casey





More information about the Zope mailing list