[ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

Tue Sep 30 18:30:32 EDT 2008

On Sep 30, 2008, at 1:38 PM, Dieter Maurer wrote:

> Jim Fulton wrote at 2008-9-19 13:45 -0400:
>> ...
>> 2. We (ZC) are moving to 64-bit OSs.  I've resisted this for a while
>> due to the extra memory overhead of 64-bit pointers in Python
>> programs, but I've finally (too late) come around to realizing that
>> the benefit far outweighs the cost.  (In this case, the process was
>> around 900MB in size.
>
> That is very strange.
> On our Linux systems (Debian etch), the processes can use 2.7 to 2.9  
> GB
> of memory before the os refuses to allocate more.

Yeah. Strange.

>> It was probably trying to malloc a few hundred
>> MB.  The malloc failed despite the fact that there was more than 2GB
>> of available process address space and system memory.)
>>
>> 3. I plan to add code to FileStorage's _finish that will, if there's
>> an error:
>>
>>  a. Log a critical message.
>>
>>  b. Try to roll back the disk commit.

I decided not to do this. Too complicated.

>>
>>
>>  c. Close the file storage, causing subsequent reads and writes to
>> fail.
>
> Raise an easily recognizable exception.

I raise the original exception.

> In our error handling we look out for some nasty exceptions and  
> enforce
> a restart in such cases. The exception above might be such a nasty
> exception.

The critical log entry should be easy enough to spot.

...

>> I considered some other ideas:
>>
>> - Try to get FileStorage to repair it's meta data.  This is certainly
>> theoretically doable.  For example, it could re-build it's in-memory
>> index. At this point, that's the only thing in question. OTOH,
>> updating it is the only thing left to fail at this point.  If  
>> updating
>> it fails, it seems likely that rebuilding it will fail as well.
>>
>> - Have a storage server restart when a tpc_finish call fails.  This
>> would work fine for FileStorage, but might be the wrong thing to do
>> for another storage.  The server can't know.
>
> Why do you think that a failing "tpc_finish" is less critical
> for some other kind of storage?

It's not a question of criticality.  It's a question of whether a  
restart will fix the problem.  I happen to know that a file storage  
would be in a reasonable state after a restart.  I don't know this to  
be the case for some other storage.

Jim

--
Jim Fulton
Zope Corporation