[ZODB-Dev] Data file corruption and recovery

Erik Dahl edahl@zentinel.net
Thu, 13 Feb 2003 17:44:03 -0500


Jeremy,

I never got the zeo server to tell me anything (I started it in debug 
mode to make sure).   Now I wasn't in a very patient mood so I didn't 
wait more than 3 mins or so.  I figured that the problem was some sort 
of corruption as the older file loaded with no problem.  I don't see any 
.trN file but maybe this is due to my impatience.  I have the original 
file but it is fairly big (67MB when gziped).   fsrecvoer didn't say 
anything other than that 0 data was lost during recovery.  As Toby 
suggests it may be possible that the machine was writing bad data for a 
while as a result of the cpu.  Is it possible that there was no problem 
and that I should have been more patient with the startup time of zeo?

-EAD

Jeremy Hylton wrote:

>On Thu, 2003-02-13 at 09:33, Erik Dahl wrote:
>  
>
>>Yesterday I had a cpu failure on a box that caused the sudden reboot of 
>>a zeo server.  When the service was brought up on the other side of the 
>>cluster it didn't start.  I figured this was due to data corruption and 
>>when using a backup the server started fine.  The problem was the backup 
>>was a little stale so I wanted to try recovering the corrupt file.  I 
>>found two methods for fixing the file running fsrecover.py or running 
>>tranalyzer.py then using its output to truncate the data file. 
>>fsrecover.py did fix my problem but only after running for around 6 
>>hours and generating no output other than to say that no data was lost. 
>>  The tranalyzer method never worked.  My questions are:
>>    
>>
>
>What happened when you tried to start up the zeo server?  I must admit
>that I haven't run into this problem in real deployment, and I don't
>remember what the storage / server is supposed to say.
>
>Did it create a .trN file?  That would indicate it figured out what
>transactions to delete.
>
>Do you have the original file?  If you run into file storage corruption
>problems, it's helpful to us developers if you keep a copy of the
>damaged file.
>
>  
>
>>1. how can you figure out what the server is doing when you have a 
>>corrupted file (I tried setting STUPID_LOG_SEVERITY to -300 with no 
>>results).  
>>    
>>
>
>It did say something, right?  Otherwise you would not have known that
>the storage was corrupted.  There should be some complaint during the
>initial startup, but if it starts up successfully I wouldn't expect
>further error reports in the log.
>
>  
>
>>2. any idea why taking transactions off the end of the file didn't fix 
>>the problem?
>>    
>>
>
>Do you know what changes fsrecover made to fix the problem?
>
>  
>
>>3. would directory storage handle this situation better or do I need to 
>>go to a berkeley db backend?
>>    
>>
>
>It's surprising that truncating the file didn't solve the problem. 
>FileStorage should be pretty robust against these sorts of crashes.  It
>calls sync() in tpc_finish(), so it's quite unlikely that a reboot would
>do anything other than leave incomplete transaction data at the end.
>
>Jeremy
>
>  
>