[ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

Tue Sep 30 13:38:24 EDT 2008

Jim Fulton wrote at 2008-9-19 13:45 -0400:
> ...
>2. We (ZC) are moving to 64-bit OSs.  I've resisted this for a while  
>due to the extra memory overhead of 64-bit pointers in Python  
>programs, but I've finally (too late) come around to realizing that  
>the benefit far outweighs the cost.  (In this case, the process was  
>around 900MB in size.

That is very strange.
On our Linux systems (Debian etch), the processes can use 2.7 to 2.9 GB
of memory before the os refuses to allocate more.

>It was probably trying to malloc a few hundred  
>MB.  The malloc failed despite the fact that there was more than 2GB  
>of available process address space and system memory.)
>
>3. I plan to add code to FileStorage's _finish that will, if there's  
>an error:
>
>   a. Log a critical message.
>
>   b. Try to roll back the disk commit.
>
>   c. Close the file storage, causing subsequent reads and writes to  
>fail.

Raise an easily recognizable exception.

In our error handling we look out for some nasty exceptions and enforce
a restart in such cases. The exception above might be such a nasty
exception.

If possible, the exception should provide full information about
the original exception (in the way of the nested exceptions of Java,
emulated by Tim at some places in the ZODB code).

>4. I plan to fix the client storage bug.
>
>I can see 3c being controversial. :) In particular, it means that your  
>application will be effectively down without human intervention.

That's why I would prefer an easily recognizable exception -- in order
to restart automatically.

>I considered some other ideas:
>
>- Try to get FileStorage to repair it's meta data.  This is certainly  
>theoretically doable.  For example, it could re-build it's in-memory  
>index. At this point, that's the only thing in question. OTOH,  
>updating it is the only thing left to fail at this point.  If updating  
>it fails, it seems likely that rebuilding it will fail as well.
>
>- Have a storage server restart when a tpc_finish call fails.  This  
>would work fine for FileStorage, but might be the wrong thing to do  
>for another storage.  The server can't know.

Why do you think that a failing "tpc_finish" is less critical
for some other kind of storage?

-- 
Dieter