[ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server
Jim Fulton
jim at zope.com
Tue Sep 30 18:30:32 EDT 2008
On Sep 30, 2008, at 1:38 PM, Dieter Maurer wrote:
> Jim Fulton wrote at 2008-9-19 13:45 -0400:
>> ...
>> 2. We (ZC) are moving to 64-bit OSs. I've resisted this for a while
>> due to the extra memory overhead of 64-bit pointers in Python
>> programs, but I've finally (too late) come around to realizing that
>> the benefit far outweighs the cost. (In this case, the process was
>> around 900MB in size.
>
> That is very strange.
> On our Linux systems (Debian etch), the processes can use 2.7 to 2.9
> GB
> of memory before the os refuses to allocate more.
Yeah. Strange.
>> It was probably trying to malloc a few hundred
>> MB. The malloc failed despite the fact that there was more than 2GB
>> of available process address space and system memory.)
>>
>> 3. I plan to add code to FileStorage's _finish that will, if there's
>> an error:
>>
>> a. Log a critical message.
>>
>> b. Try to roll back the disk commit.
I decided not to do this. Too complicated.
>>
>>
>> c. Close the file storage, causing subsequent reads and writes to
>> fail.
>
> Raise an easily recognizable exception.
I raise the original exception.
> In our error handling we look out for some nasty exceptions and
> enforce
> a restart in such cases. The exception above might be such a nasty
> exception.
The critical log entry should be easy enough to spot.
...
>> I considered some other ideas:
>>
>> - Try to get FileStorage to repair it's meta data. This is certainly
>> theoretically doable. For example, it could re-build it's in-memory
>> index. At this point, that's the only thing in question. OTOH,
>> updating it is the only thing left to fail at this point. If
>> updating
>> it fails, it seems likely that rebuilding it will fail as well.
>>
>> - Have a storage server restart when a tpc_finish call fails. This
>> would work fine for FileStorage, but might be the wrong thing to do
>> for another storage. The server can't know.
>
> Why do you think that a failing "tpc_finish" is less critical
> for some other kind of storage?
It's not a question of criticality. It's a question of whether a
restart will fix the problem. I happen to know that a file storage
would be in a reasonable state after a restart. I don't know this to
be the case for some other storage.
Jim
--
Jim Fulton
Zope Corporation
More information about the ZODB-Dev
mailing list