[ZODB-Dev] Re: Using Catalog and BTrees

Andrew Dalke dalke@acm.org
Sun, 27 May 2001 13:11:16 -0600


Hello,

  I'm not on this list but we're planning to upgrade so
I've been catching up on the state of ZODB.  FYI, we're
using Zope 2.2.2 but only the ZODB parts of it, and I
want to switch to using StandaloneZODB.

  I see there was a recent discussion questioning if
anyone uses Catalog outside of ZODB.  We do, and needed
to make a few changes to it.  I thought people here might
be interested in both the application and the outline of
those changes.

  I developed a simple chemical information system for
one of my clients using ZODB.  (Simple because it doesn't
handle chemistry specific requests, like similarity or
substructure searches.)  It does allow property searches,
so I can ask for a compound with a given name, or one
whose molecular weight is within a given range.

  Property searching was done using the Catalog class,
despite the relative lack of documentation on how to do
it.

  I created my own Database object which holds a ZODB.DB.
Under that are things like the compound data, the catalog,
the list of indexers, global parameters, etc.  The
create_database function works something like:

def create_database(filename, indexers):
  db = ZODB.DB(FileStorage.FileStorage(filename))
  connection = db.open()
  root = connection.root()
  ...
  cat = root["catalog"] = Catalog()
  cat.aq_parent = root

  indexes = cat.indexes
  for name, extractor in indexers.items():
    indexes[name] = Index.Index(name, extractor)
  cat.indexes = indexes

  get_transaction().commit()
  db.close()
  return Database(filename)

A couple of comments here:
  - Michel Pelletier recently pointed out using 'cat.aq_parent = root'
      is incorrect and a better way is 'cat = cat.__of__(root)'
  - I plan to use BSDDB instead of FileStorage because I don't
      need transactions and it looks like we may be hitting
      some memory limitations because of the number and size
      of objects stored.  Still checking this out.


The Index code is a modified version of UnIndex.py which
works with compounds, not just simple data types.  It allows
indexing of different properties, range searches, and indexing
on lists as well as scalar data types.

Catalog allowed range searches through what I thought was
a hack - as I recall, there was a different field in the query
which contains the command to do the range search.  I created
a Range object, so range searches can be done with searches
like
  db.search(nAtoms = 5, weight = Range(20.0, 60.0))
  db.search(nAtoms = 5, weight = Range(min=20.0))
  db.search(nAtoms = 5, weight = Range(max=60.0))

Allowing lists was done by modifying the index_object to
insert multiple values for a given field, and storing the
information needed for unindex_object to work.

Support for compounds was done by defining an adapter protocol
which gets the requested data from the search object and
returns a list.  The protocol is called an Extractor and
it defines __call__ to return the list of extracted data,
datafields() to return the list of properties used, and
normalize(s), which normalizes the input query strings before
looking things up in the _index.

So to search the "molecular_weight" property of a compound
with a search named "weight", I need to tell the database
which how to extract the data, like this.

  db.addIndex("weight", Index.ByScalarKey("molecular_weight"))

ByScalarKey is an Extractor which
  - gets the key "molecular_weight"] from each compound, as in
     compound["molecular_weight"]
  - assumes it's a scalar value, so turns the result into a
     list for indexing

As mentioned, searching is done in the Zope-like way of passing
in a set of kwargs, like

   cmpds = db.search(nAtoms = 5, weight = Range(max=60.0))

The implementation is

  def search(self, **params):
      catalog = self.root["catalog"]
        ... some assertion tests to make sure the search keys are valid ...
      return MoleculeList(self.root, catalog, catalog.searchResults(params))

and the MoleculeList converts the brains_list result of the search
into a list-like interface which returns actual compounds, rather like

class MoleculeList:
    def __init__(self, root, catalog, brains_list):
       ... save the parameters to instance variables ...
    def __len__(self):
       return len(self.brains_list)
    def __getitem__(self, i):
       return self.root[self.catalog.paths[
                   self.brains_list[i].data_record_id_]]
    def __getslice__(self, i, j):
       return MoleculeList(self.root, self.catalog, self.brains_list[i:j])

The "addIndex" method takes the name of the search field and the
Extractor, ads the new Index to the Catalog, and indexes all of
the compounds already in the database.

class Database:
     ...
    def addIndex(self, name, extractor):
        catalog = self.root["catalog"]
        indexes = catalog.indexes
        new_index = Index.Index(name, extractor)
        indexes[name] = new_index
        catalog.indexes = indexes

        new_index = new_index.__of__(catalog)
        for uid, i in catalog.uids.items():
            new_index.index_object(i, self.root[uid], None)

Finally, after a compound has been added to the database (which isn't
an atomic operation), or had one of its properties modified, it
is indexed by calling the Database's "index" method.

class Database:
     ...
    def index(self, mol):
        catalog = self.root["catalog"]
        name = ... get the unique name for the molecule ...
        if catalog.uids.has_key(name):
            catalog.uncatalogObject(name)
        catalog.catalogObject(mol, name)



The end result is I can do a searches and eaily print the result,
as with the following (various parts have been removed to show
the the details of the loading and searching mechanism):

  db = Database.create_database("test.fs",
            indexers = {
               "weight": Index.ByScalarAttribute("molecular_weight"),
               "charge": Index.ByScalarAttribute("charge"),
               "nHalogens": Index.ByScalarAttribute("nHalogens"),
               "nRings": Index.ByScalarAttribute("nRings"),
            })

  # First line contains the headers "smiles", "molecular_weight",
  # "charge", "nHalogens", and "nRings", which the reader uses to
  # know the names for the fields in the following lines.
  for record in TabReader(open("input.data")):
      mol = db.add(record["smiles"])
      for k, v in record.items():
          if k == "smiles": continue  # can't modify primary key
          mol[k] = v
      db.index(mol)

  get_transaction().commit()

  for cmpd in db.search(weight = Range(max=1000.0),
                        charge = Range(min=-2, max=2),
                        nHalogens = [0, 2],  # 0 or 2 halogens
                        nRings = 2):
      print cmpd["name"], cmpd["smiles"]

And I must say, I really, really like this ability, despite
having to muck through in the code and mailing lists to figure
out how Catalog works, and doing some mods to make it search
the fields I wanted it to.

Hope this description of our needs for ZODB proves enlightening :)
At the very least, it would be nice if some of cataloging was
changed so I wouldn't have to make deep modifications of the
UnIndex module.

Thanks Digital Creations and everyone else for ZODB!

                    Andrew
                    dalke@acm.org