[ZODB-Dev] Data file corruption and recovery

Jeremy Hylton jeremy@zope.com
13 Feb 2003 10:34:27 -0500


On Thu, 2003-02-13 at 09:33, Erik Dahl wrote:
> Yesterday I had a cpu failure on a box that caused the sudden reboot of 
> a zeo server.  When the service was brought up on the other side of the 
> cluster it didn't start.  I figured this was due to data corruption and 
> when using a backup the server started fine.  The problem was the backup 
> was a little stale so I wanted to try recovering the corrupt file.  I 
> found two methods for fixing the file running fsrecover.py or running 
> tranalyzer.py then using its output to truncate the data file. 
> fsrecover.py did fix my problem but only after running for around 6 
> hours and generating no output other than to say that no data was lost. 
>   The tranalyzer method never worked.  My questions are:

What happened when you tried to start up the zeo server?  I must admit
that I haven't run into this problem in real deployment, and I don't
remember what the storage / server is supposed to say.

Did it create a .trN file?  That would indicate it figured out what
transactions to delete.

Do you have the original file?  If you run into file storage corruption
problems, it's helpful to us developers if you keep a copy of the
damaged file.

> 1. how can you figure out what the server is doing when you have a 
> corrupted file (I tried setting STUPID_LOG_SEVERITY to -300 with no 
> results).  

It did say something, right?  Otherwise you would not have known that
the storage was corrupted.  There should be some complaint during the
initial startup, but if it starts up successfully I wouldn't expect
further error reports in the log.

> 2. any idea why taking transactions off the end of the file didn't fix 
> the problem?

Do you know what changes fsrecover made to fix the problem?

> 3. would directory storage handle this situation better or do I need to 
> go to a berkeley db backend?

It's surprising that truncating the file didn't solve the problem. 
FileStorage should be pretty robust against these sorts of crashes.  It
calls sync() in tpc_finish(), so it's quite unlikely that a reboot would
do anything other than leave incomplete transaction data at the end.

Jeremy