[ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

Fri Sep 19 13:45:32 EDT 2008

We had a rather bad error recently and I'm thinking about how to avoid  
it in the future.  I'm sharing it and my thoughts here to see what  
other helpful input other might have. :)

We got a memory error in the File storage _finish method, which is  
called to complete the second phase of two-phase commit.  This was  
when updating the tid cache (oid2tids), which was a dictionary that  
had grown rather big. (We have many millions of objects in our  
databases.)  This occurred after the data has been written to disk.   
There were a number of bad outcomes of this:

- The data were written to disk, but invalidations weren't sent to  
clients.   Because the file storage was still functional, subsequent  
reads of these objects would return the data written.  This meant the  
clients' view of the database was inconsistent.

- The internal FileStorage meta data was partially updated. In  
particular, the object index was updated, but the last transaction  
wasn't.

- The FileStorage continued to function. Subsequent commits had the  
same outcome, causing more damage.  Fortunately, this damage was  
limited by a ClientStorage bug (see below).

- When this error occured, the client involved was unable to commit  
additional transactions due to do a ClientStorage bug. ClientStorage  
tpc_finish doesn't handle server errors properly. It always considers  
a transaction finished at the end of tpc_finish. As a result, it  
ignored the subsequent tpc_abort call and never sent a tpc_abort call  
to the server.  Subsequent tpc_begin calls from the client were  
rejected because of the outstanding transaction for the client.   
Despite the fact that this limited the damage of the other errors,  
this bug needs to be fixed.

The database inconsistencies resulting from these failures have caused  
us a fair bit of pain.

I'm taking a number of steps to avoid this failure in the future:

1. I've removed the tid cache and the save-index-after-many-writes  
features because they were both likely sources of errors in _finish.   
They were also both problematic in other ways.  The tid cache consumed  
too much memory and the code to save the index after many writes had a  
flawed algorithm for deciding how often to write that caused it to  
never provide any benefit.  Both of these features have potential  
benefits if done well some day.

2. We (ZC) are moving to 64-bit OSs.  I've resisted this for a while  
due to the extra memory overhead of 64-bit pointers in Python  
programs, but I've finally (too late) come around to realizing that  
the benefit far outweighs the cost.  (In this case, the process was  
around 900MB in size. It was probably trying to malloc a few hundred  
MB.  The malloc failed despite the fact that there was more than 2GB  
of available process address space and system memory.)

3. I plan to add code to FileStorage's _finish that will, if there's  
an error:

   a. Log a critical message.

   b. Try to roll back the disk commit.

   c. Close the file storage, causing subsequent reads and writes to  
fail.

4. I plan to fix the client storage bug.

I can see 3c being controversial. :) In particular, it means that your  
application will be effectively down without human intervention.

I considered some other ideas:

- Try to get FileStorage to repair it's meta data.  This is certainly  
theoretically doable.  For example, it could re-build it's in-memory  
index. At this point, that's the only thing in question. OTOH,  
updating it is the only thing left to fail at this point.  If updating  
it fails, it seems likely that rebuilding it will fail as well.

- Have a storage server restart when a tpc_finish call fails.  This  
would work fine for FileStorage, but might be the wrong thing to do  
for another storage.  The server can't know.

   OTOH, if there is a failure at a higher level, the server might  
want to restart. In particular, if the call to tpc_finish on the  
underlying storage has succeeded, but invalidations haven't been set,  
a storage server restart seems appropriate.

The good news is that after doing 1, I think the chance of a failure  
in _finish is vastly reduced.  I think that, in practice, the steps in  
3, especially 3c, will never be necessary.   Still, I think it's  
prudent to take (tested) steps to handle even this unlikely case.

Comments are welcome.

Jim

--
Jim Fulton
Zope Corporation