[ZODB-Dev] standalone ZODB for large Numeric arrays

Mike Mueller mmueller@dgfz.de
Fri, 28 Feb 2003 21:20:44 +0100


Hi everybody,

I am new to this list. I just finished (well, interrupted
development until the next research projects starts) a
rather large numerical model named MODGLUE.  Its manly
written in Python using 3 already existing big numerical models
(2 in FORTRAN one in C).  One of the FORTRAN models was
considerably modified by me and was turned into a Python
extension, the other two are just executables.  I do
input generation and output reading as well as restarting
with Python to stir them.  The whole thing works just
like one big model thanks to Python.  That=92s in short
what I did the last three years.

To store my data I use pickle, netCDF and lately ZODB.
ZODB works nicely.  Since I have large amounts of data
(several GBs) in form of Numeric arrays I use netCDF,
a binary format especially designed for array data
storage and efficient access. It is written in C and
has interfaces to C++, FORTRAN, Java, Perl and also Python
(included in Konrad Hinsen's Scientific).  It=92s platform
independent and optimised for large array based data.

Instead of storing data in different formats I would
like to, eventually, move entirely to ZODB.  I would
also like to use the functionality and speed that
netCDF provides.  My ideal solution would be a netCDF
enabled ZODB that internally stores numerical arrays
in netCDF.  The ZODB API should be extended in such
a way that I can tell ZODB to store this object in netCDF.
Also a switch 'store_all_numerical_in_netCDF' that
automatically uses netCDF files to store a array
would be nice.

Since I am only a user of ZODB and don=92t know anything
about its internals I am curious if something along
this lines would be feasible:


from ZODB import FileStorage, DB
from Scientific.IO.NetCDF import * #nectCDF interface from Konrad


NcEnabledDB(DB):
     """
     ZODB that stores all Numeric arrays in netCDF files.
     Code is not tested nor believed (not even supposed) to work.
     This is just an expression of the idea using Python's what you see
     is what get (what I imagine is what I want ?)
     i.e. store_all_numerical_in_netCDF =3D 1
     """
     def __init__(self,
                  storage,
                  ncStoragePath =3D None,
                  newNcStoragePath =3D None):
         self.storage =3D storage
         DB.__init__(self.storage)
         self.ncStoragePath =3D ncStoragePath
         self.newNcStoragePath =3D newNcStoragePath
         self.setNcPath()

     def setNcPath(self)
         if self.ncStoragePath:
             if not self.root.has_key('ncStoragePath']:
                 self.root['ncStoragePath']=3D self.ncStoragePath
             elif not self.newNcStoragePath:
                 raise NcStoragePathError, \
                         'ncStoragePath already exist set newNcStoragePath=
=20
true for new path'
         else: #default path
             if storage.__class__ =3D=3D 'ZODB.FileStorage.FileStorage':
                 dirName =3D=20
os.path.join(os.path.dirname(self.storage.getName()), 'ncPath')
                 self.root['ncStoragePath']=3D dirName
             else:
                 raise NotImplementedError, self.storage.__class__ + 'not=20
supported'

     def __getitem__(self, key, slice=3D None):
         """ Slice is not nice"""
         if type(self.root[key]) =3D=3D 'array':
             self.getNcItem(key, slice)
         else:
             return self.root[key]

     def __setitem__(self, key, item):
         """Needs slice"""
         if type(item) =3D=3D 'array':
             self.setNcItem(key, item)
         else:
             self.root[key] =3D item

     def __delitem__(self, key):
         """Needs slice"""
         if type(self.root[key]) =3D=3D 'array':
             self.delNcItem[key]
         else:
             del self.root[key]

     def self.getNcItem(self, key, slice):
         """Whole array or slice of array collected form netCDF file.
         Slice syntax needs to be refined.
         """
         path =3D os.path.join(self.root['ncStoragePath'], '%s.nc' %key)
         ncFile =3D NetCDFFile(path, 'r')
         if slice:
             value =3D ncFile.variables[key][slice] # slice syntax !!!
         else:
             value =3D ncFile.variables[key][:]
         ncFile .close()
         return value

     def self.setNcItem(self, key, item):
         path =3D os.path.join(self.root['ncStoragePath'], '%s.nc' %key)
         ncFile =3D NetCDFFile(path, 'r+')
         if slice:
             ncFile.variables[key][slice] =3D item[slice] # slice syntax !!!
         else:
             ncFile.variables[key][:] =3D item[:]
         ncFile .close()


     # more code to follow...


Eventually this should 'feel' like the real thing (i.e. ZODB)
but store the (huge) arrays in netCDF files. Things like
undo or transactions may not be possible (necessary?).
netcCDF arrays may have a so called unlimited dimension.
This is very often used for storing data at different time
steps. For example my model produces daily data of
concentrations in a 2D array. The netCDF array is 3D.
The first dimension being the unlimited dimension counts
the time steps.

[[[111, 112, 113],             time step  1
    [121, 122, 123]],
  [[211, 212, 213], ],          time step  2
    [221, 222, 223]],
  [[311, 312, 313], ],          time step  3
    [321, 322, 323]]]

This feature is very important and needs to be supported
by the modified ZODB. More details may
become important down the road ....


Is this a good idea? Is there a better way to
implement the whole thing?

Any suggestions, criticism, opinions are welcome.

Thanks for feedback.


Mike
---------------------------------------------------------
	Dipl.-Ing. Mike M=FCller, M.Sc.
	Dresdner Grundwasserforschungszentrum e.V.
	Meraner Str. 10
	D-01217 Dresden
Tel.:  	0351/4050675
Fax.:  	0351/4050679
e-mail: mmueller@dgfz.de
----------------------------------------------------------