[ZODB-Dev] How to check for setting the same values on persistent objects?

Fri May 6 09:50:17 EDT 2011

On Wed, May 4, 2011 at 5:53 AM, Hanno Schlichting <hanno at hannosch.eu> wrote:
> Hi.
>
> I tried to analyze the overhead of changing content in Plone a bit. It
> turns out we write back a lot of persistent objects to the database,
> even tough the actual values of these objects haven't changed.
>
> Digging deeper I tried to understand what happens here:
>
> 1. persistent.__setattr__ will always set _p_changed to True and thus
> cause the object to be written back
> 2. Some BTree buckets define the "VALUE_SAME" macro. If the macro is
> available and the new value is the same as the old, the change is
> ignored
> 3. The VALUE_SAME macro is only defined for the int, long and float
> value variants but not the object based ones
> 4. All code in Products.ZCatalog does explicit comparisons of the old
> and new value and ignores non-value-changes. I haven't seen any other
> code doing this.
>
> I'm assuming doing a general check for "old == new" is not safe, as it
> might not be implemented correctly for all objects and doing the
> comparison might be expensive.
>
> But I'm still curious if we could do something about this. Some ideas:
>
> 1. Encourage everyone to do the old == new check in all application
> code before setting attributes on persistent objects.
>
> Pros: This works today, you know what type of values you are dealing
> with and can be certain when to apply this, you might be able to avoid
> some computation if you store multiple values based on the same input
> data
> Cons: It clutters all code

-1 at suggested, but it might be worth asking if there should be
changes to infrastructure that encourages lots of spurious attribute
updates.

> 2. Create new persistent base classes which do the checking in their
> __setattr__ methods
>
> Pros: A lot less cluttering in the application code
> Cons: All applications would need to use the new base classes.
> Developers might not understand the difference between the variants
> and use the "checking" versions, even though they store data which
> isn't cheap to compare

-1.  This feels like adding a solution to some other solution. :)

>
> 2.a. Create new base classes and do type checking for built-in types
>
> Pros: Safer to use than always doing value comparisons
> Cons: Still separate base classes and overhead of doing type checks

ditto

>
> 3. Compare object state at the level of the pickled binary data
>
> This would need to work at the level of the ZODB connection. When
> doing savepoints or commits, the registered objects flagged as
> _p_changed would be checked before being added to the modified list.
> In order to do this, we need to get the old value of the object,
> either by loading it again from the database or by keeping a cache of
> the non-modified state of all objects. The latter could be done in
> persistent.__setattr__, where we add the pristine state of an object
> into a separate cache before doing any changes to it. This probably
> should be a cache with an upper limit, so we avoid running out of
> memory for connections that change a lot of objects. The cache would
> only need to hold the binary data and not unpickle it.
>
> Pros: On the level of the binary data, the comparisons is rather cheap
> and safe to do
> Cons: We either add more database reads or complex change tracking,
> the change tracking would require more memory for keeping a copy of
> the pristine object. Interactions with ghosted objects and the new
> cache could be fragile.

There are also possible subtle consistency issues.  If an application
assigns the same value to a variable and some other transaction
assigns a different value, should the 2 conflict? Arguably so.

> 4. Compare the binary data on the server side
>
> Pros: We can get to the old state rather quickly and only need to deal
> with binary string data
> Cons: We make all write operations slower, by adding additional read
> overhead. Especially those which really do change data. This won't
> work on RelStorage. We only safe disk space and cache invalidations,
> but still do the bulk of the work and sent data over the network.
>
>
> I probably missed some approaches here. None of the approaches feels
> like a good solution to me. Doing it server side (4) is a bad idea in
> my book. Option 3 seems to be the most transparent and safe version,
> but is also the most complicated to write with all interactions to
> other caches. It's also not clear what additional responsibilities
> this would introduce for subclasses of persistent which overwrite
> various hooks.
>
> Maybe option one is the easiest here, but it would need some
> documentation about this being a best practice. Until now I didn't
> realize the implications of setting attributes to unchanged values.

I think the best approach is to revisit the application infrastructure that's
causing all these spurious updates.

Jim

-- 
Jim Fulton
http://www.linkedin.com/in/jimfulton