[ZODB-Dev] Referential integrity
Mike C. Fletcher
mcfletch@rogers.com
Sat, 02 Mar 2002 01:14:34 -0500
Your problem is harder than mine...
How I do it (a room-booking application):
Booking
user property
User object
other properties
conflicts (regular attribute, calculated from the other attributes)
With all bookings stored in a single collection (referenced all over the
place, but required to be in the collection), and all users stored in a
single collection (same idea).
property interactions: When Booking properties change, the property
classes for each property trigger a check for conflicts (they use a
bisect-based index of the Bookings collection so that this only takes a
few milliseconds/attribute update) and a notification to all GUI objects
to update the visualisation.
user table watching: On deletion of a user, need to find all bookings
referencing that user and let the administrator figure out what to do
with them. So, using Patrick's dispatcher module, I watch the user
collection's 'delete' and 'deletes' messages (basically, the collection
class generates these events whenever a user, or multiple users are
deleted). The watcher just scans the Booking table, loading each
booking and making sure that the user property is still valid.
That, of course, is horribly inefficient, as _every single booking_ (and
there can be tens or hundreds of thousands of them) needs to be loaded
into memory just to do the single integrity check. I could, I suppose,
create an index mapping booking:user, but since deleting a user is a
_very_ infrequent event (should only happen 1 or 2 times a year in this
system when there is a need to clean up the user list to make it easier
to scan, and could even be disallowed entirely if I felt like being
hard-nosed about it), I don't care.
The thing about doing the referential integrity thing efficiently is
that you really need to be able to scan a field in a table, not a
collection of objects which require loading/storing before you can
process them. I could see doing it with heavily constrained object
collections such as you see in other ODBMS (for example, a collection of
homogenous objects where only properties are stored for a certain root
class, all instances have the same base set of properties (plus an
optional "extra data" field, or an extension table with sub-class data)
which can be quickly scanned)), or, more generally with an indexing
scheme that can efficiently build tables based on object properties.
As far as I know, there's no automatic way to create an index table for
the ZODB (you'd want it to be outside the ZODB so that it wouldn't be
versioned, could use static data typing, etceteras). Hyper-G (another
object database (big web content engine as well, but underneath, a
distributed database)) used to have an explicit "references" table that
could be scanned, and all references went through that table (rather
than directly to the object). That would be a pretty significant
restructuring of the ZODB, though, I think. [Conceptually, the
reference happens with the ZODB oid, but there's no table mapping
referer to referee with oids, creating that table and the support for it
being the mentioned restructuring]
All-in-all, lots of hacky ways to do what you want, but I can't think of
an elegant and clean one that could be done today. Here are some ideas
of what you could do with some fairly minor surgery:
1) References table -- objects wanting referential integrity checks must
go through a table, table should not use versioning, should have tightly
constrained semantics, maybe as simple as:
refererID:referenceID
So that your ref integrity scan would just flag all referer's as
"potentially invalid" and report them to the application, which would
figure out what to do with them.
Note: this can wind up being a _huge_ (and highly redundant, when you
consider that all of this information is stored in the ODBMS anyway) and
very volatile table, so you'd really want to keep it in an RDBM, I'd think.
To make this pretty, you want the machinery to support it going into the
pickler, so that the references are automatically maintained (i.e. on
store, you take the old references _from_ object, remove them, then add
the new references _from_ object. Similarly, the GC would scan the
table automatically on de-commisioning an object and report possible ref
integrity failures for all refs _to_ object.) Unfortunately, to create
any sophisticated checks, you'd want the referers to be loaded and
processed on each store of the referenced object (i.e. if your check is
"owner of Booking must be a user with more than $2000 on account", you
want to load each referencing Booking and run the check for it).
Blech... this is engineering teritory if you're going to avoid a mass of
spaghetti and soup.
2) Generic index scanner -- for any indexed property, where property
value == changed object, flag as invalid. This is a better approach if
you're going to have just about every attribute indexed anyway and don't
mind doing a bit of extra scanning to figure out what values are object
references. Doesn't support ref-integ for properties not indexed.
Properly worked, you could make the index storage also act as primary
storage for the data values I suppose, which would reduce the
redundancy/overhead of the indexing scheme (but make the object records
more complex)
3) Persistent Weakrefs -- for your particular app, a persistent weak
reference, with back references would seem to work (basically, keep the
ref:ref table as an attribute of your objects).
Basically in my application, the only thing that I need the checks to be
fast on are the "conflict" sub-tables. For those, I constructed my own
index so I can do time-range-based queries very cheaply. For everything
else I just watch the collections and do the brute-force load-and-scan
checks when an item is deleted (which is the only check I currently care
about).
Dumb designer will now stop blathering at crowd of comp-sci and comp-eng
people about things they already know :) . I'll go back to drawing now :) ,
Mike
Magnus Lyckå wrote:
> At 17:47 2002-03-01 -0500, Mike Fletcher wrote:
>
>> I actually had to deal with referential integrity problems in an app
>> just two weeks ago. Best approach I found was to wire in a watcher
>> that looks for ObjectDeleted events on the collection, and on finding
>> one, does the scan explicitly. Similarly, the property classes for
>> the objects refresh conflict sub-tables and check object consistency
>> as soon as an attribute is altered.
>
>
> Huh? How do you do that? I'm just thinking about a
> similar case in my app. (But here the problem is probably
> more a matter of finding all the places where I might have
> a link to this object. There's no magical "who references
> me" function in Python or ZODB, is there?)
...
> and "magically" be able to find a from b without
> traversing all my objects, well, then that would
> be very convenient...
...