[ZODB-Dev] Referential integrity

Sat, 02 Mar 2002 01:14:34 -0500

Your problem is harder than mine...

How I do it (a room-booking application):

Booking
	user property
		User object
	other properties
	conflicts (regular attribute, calculated from the other attributes)

With all bookings stored in a single collection (referenced all over the 
place, but required to be in the collection), and all users stored in a 
single collection (same idea).

property interactions:  When Booking properties change, the property 
classes for each property trigger a check for conflicts (they use a 
bisect-based index of the Bookings collection so that this only takes a 
few milliseconds/attribute update) and a notification to all GUI objects 
to update the visualisation.

user table watching: On deletion of a user, need to find all bookings 
referencing that user and let the administrator figure out what to do 
with them.  So, using Patrick's dispatcher module, I watch the user 
collection's 'delete' and 'deletes' messages (basically, the collection 
class generates these events whenever a user, or multiple users are 
deleted).  The watcher just scans the Booking table, loading each 
booking and making sure that the user property is still valid.

That, of course, is horribly inefficient, as _every single booking_ (and 
there can be tens or hundreds of thousands of them) needs to be loaded 
into memory just to do the single integrity check.  I could, I suppose, 
create an index mapping booking:user, but since deleting a user is a 
_very_ infrequent event (should only happen 1 or 2 times a year in this 
system when there is a need to clean up the user list to make it easier 
to scan, and could even be disallowed entirely if I felt like being 
hard-nosed about it), I don't care.

The thing about doing the referential integrity thing efficiently is 
that you really need to be able to scan a field in a table, not a 
collection of objects which require loading/storing before you can 
process them.  I could see doing it with heavily constrained object 
collections such as you see in other ODBMS (for example, a collection of 
homogenous objects where only properties are stored for a certain root 
class, all instances have the same base set of properties (plus an 
optional "extra data" field, or an extension table with sub-class data) 
which can be quickly scanned)), or, more generally with an indexing 
scheme that can efficiently build tables based on object properties.

As far as I know, there's no automatic way to create an index table for 
the ZODB (you'd want it to be outside the ZODB so that it wouldn't be 
versioned, could use static data typing, etceteras).  Hyper-G (another 
object database (big web content engine as well, but underneath, a 
distributed database)) used to have an explicit "references" table that 
could be scanned, and all references went through that table (rather 
than directly to the object).  That would be a pretty significant 
restructuring of the ZODB, though, I think.  [Conceptually, the 
reference happens with the ZODB oid, but there's no table mapping 
referer to referee with oids, creating that table and the support for it 
being the mentioned restructuring]

All-in-all, lots of hacky ways to do what you want, but I can't think of 
an elegant and clean one that could be done today.  Here are some ideas 
of what you could do with some fairly minor surgery:

	1) References table -- objects wanting referential integrity checks must 
go through a table, table should not use versioning, should have tightly 
constrained semantics, maybe as simple as:
		refererID:referenceID
So that your ref integrity scan would just flag all referer's as 
"potentially invalid" and report them to the application, which would 
figure out what to do with them.
Note: this can wind up being a _huge_ (and highly redundant, when you 
consider that all of this information is stored in the ODBMS anyway) and 
very volatile table, so you'd really want to keep it in an RDBM, I'd think.

To make this pretty, you want the machinery to support it going into the 
pickler, so that the references are automatically maintained (i.e. on 
store, you take the old references _from_ object, remove them, then add 
the new references _from_ object.  Similarly, the GC would scan the 
table automatically on de-commisioning an object and report possible ref 
integrity failures for all refs _to_ object.)  Unfortunately, to create 
any sophisticated checks, you'd want the referers to be loaded and 
processed on each store of the referenced object (i.e. if your check is 
"owner of Booking must be a user with more than $2000 on account", you 
want to load each referencing Booking and run the check for it). 
Blech... this is engineering teritory if you're going to avoid a mass of 
spaghetti and soup.

	2) Generic index scanner -- for any indexed property, where property 
value == changed object, flag as invalid.  This is a better approach if 
you're going to have just about every attribute indexed anyway and don't 
mind doing a bit of extra scanning to figure out what values are object 
references.  Doesn't support ref-integ for properties not indexed. 
Properly worked, you could make the index storage also act as primary 
storage for the data values I suppose, which would reduce the 
redundancy/overhead of the indexing scheme (but make the object records 
more complex)

	3) Persistent Weakrefs -- for your particular app, a persistent weak 
reference, with back references would seem to work (basically, keep the 
ref:ref table as an attribute of your objects).

Basically in my application, the only thing that I need the checks to be 
fast on are the "conflict" sub-tables.  For those, I constructed my own 
index so I can do time-range-based queries very cheaply.  For everything 
else I just watch the collections and do the brute-force load-and-scan 
checks when an item is deleted (which is the only check I currently care 
about).

Dumb designer will now stop blathering at crowd of comp-sci and comp-eng 
people about things they already know :) .  I'll go back to drawing now :) ,
Mike

Magnus Lyckå wrote:
> At 17:47 2002-03-01 -0500, Mike Fletcher wrote:
> 
>> I actually had to deal with referential integrity problems in an app 
>> just two weeks ago.  Best approach I found was to wire in a watcher 
>> that looks for ObjectDeleted events on the collection, and on finding 
>> one, does the scan explicitly.  Similarly, the property classes for 
>> the objects refresh conflict sub-tables and check object consistency 
>> as soon as an attribute is altered.
> 
> 
> Huh? How do you do that? I'm just thinking about a
> similar case in my app. (But here the problem is probably
> more a matter of finding all the places where I might have
> a link to this object. There's no magical "who references
> me" function in Python or ZODB, is there?)
...
> and "magically" be able to find a from b without
> traversing all my objects, well, then that would
> be very convenient...
...