[ZODB-Dev] Use of fsync in FileStorage

Mon Jul 26 07:10:31 EDT 2004

On Friday 23 July 2004 19:46, Marius Gedminas wrote:

> After reading this paragraph I assumed that ZODB would call os.fsync
> before overwriting the transaction status byte (and call os.fsync again
> after overwriting it).  This turns out not to be the case -- in ZODB
> 3.2.2 (bundled with Zope 2.7.1) os.fsync is only called once, *after*
> overwriting the transaction status byte, if I understood the code
> correctly.
> 
> Doesn't this mean that if the system suddenly crashes in the middle of
> os.fsync, the Data.fs on disk will contain an incomplete transaction,
> but the transaction status byte would claim that the transaction is
> complete.  Wouldn't that be bad?

I think your analysis is correct in both parts.

1. Calling fsync exactly twice is sufficent to preserve data integrity against 
"pull the power cord at an arbitrary time" problems.

2. Calling it only once is a recipe for data corruption.

fwiw, I could swear that FileStorage used to work the way you describe.

> I've been reading http://zope.org/Wikis/ZODB/FileStorageBackup (a

The backup tools recommended in that document also suffer from fsync naivety. 
It is possible for a corrupt backup to be created if power is lost soon after 
the backup completes.

On Friday 23 July 2004 22:31, Tim Peters wrote:
> We could plop in any number of additional flush+fsync calls if someone
> is having a real problem here, but calling them isn't free

> If fsync() on your box doesn't guarantee to act in an atomic ("all or
> nothing, crash or no crash") fashion, then that's a
> quality-of-implementation issue between you and your OS vendor.  

We neither expect nor need "all or nothing" behaviour from fsync over the 
whole ZODB transaction....  In the design sketched out by Marius above the 
second fsync is covering a change to only a single byte, and all modern 
hardware can do that atomically.

> We could plop in any number of additional flush+fsync calls if someone is
> having a real problem here, but calling them isn't free (moving disk
> heads is enormously expensive relative to current CPU speeds),

Who are you, and what have you done with the real Tim? We could make 
FileStorage run arbitrarily fast if there was no objection to data 
corruption.

Sooner or later this will cause a problem for someone - and Im not sure we 
could distinguish this from any other possible cause. I'm a little 
disappointed that you think its acceptable to make this compromise. (In the 
default configuration.... Its fine as a tunable parameter. Noone needs safety 
all the time)

On Friday 23 July 2004 22:56, Neil Schemenauer wrote:
> Modern discs do write caching, some of them do it no matter what the
> OS tells them.  

As you commented later, there is nothing we can do to defend against 
uncooperative hardware. 

However the majority of hardware that I expect people are using for ZODB today 
*can* do this reliably.... That includes almost all scsi, and all but the 
bottom end of the recent IDE market.

> If you need reliability it's probably better to go with replicated
> storage.

Replication alone wont help you here.... the damage caused by this defect 
might be undectable until its too late.

> For an amusng example of the other extreme, try running strace on
> the Subversion server.  I've seen it calling fsync thousands of
> times for each transaction.  Luckily there is a config option to
> disable that nonsense.

For a realistic example of a ZODB storage that calls fsync exactly the right 
number of times (no more no less) see DirectoryStorage. 
http://dirstorage.sourceforge.net/.

This storage has survived two days of intensive pull-the-power-cord testing 
while under heavy write pressure, while running on ordinary IDE hardware. 

It was designed with a "zero safety compromises" policy - if you need to never 
lose data then this may be a better option that FileStorage.

-- 
Toby Dickenson