[ZODB-Dev] Use of fsync in FileStorage

Mon Jul 26 13:13:05 EDT 2004

[Toby Dickenson]
...
> 1. Calling fsync exactly twice is sufficent to preserve data integrity
> against "pull the power cord at an arbitrary time" problems.
>
> 2. Calling it only once is a recipe for data corruption.

Your faith in fsync() is ... interesting.  You snipped the references from
the original msg to (a) the POSIX spec, which doesn't promise anything of
the sort from fsync(); and, (b) the long python-dev discussion which
specifically questioned fsync()'s "reliability" on Linux.

> fwiw, I could swear that FileStorage used to work the way you describe.

Possibly.  The oldest release I could find was StandaloneZODB 1.0, from
February 2002, which was the same in this respect.

[Marius Gedminas]
>> I've been reading http://zope.org/Wikis/ZODB/FileStorageBackup (a

> The backup tools recommended in that document also suffer from fsync
> naivety. It is possible for a corrupt backup to be created if power is
> lost soon after the backup completes.

The only backup tool recommended there is repozo.py.  It does its writes,
explicitly closes the output file, then exits.  It does not do an fsync().
I don't know why you believe adding an fsync() would make that bulletproof,
if you do believe that.  (BTW, note that even if fsync() is bulletproof on
some system, calling fsync() alone isn't enough for programs working at the
higher stream level, you need to call fflush() first.)

> ...
> We neither expect nor need "all or nothing" behaviour from fsync over the
> whole ZODB transaction....  In the design sketched out by Marius above
> the second fsync is covering a change to only a single byte, and all
> modern hardware can do that atomically.

Writing a byte atomically to a HW disk buffer isn't the same as writing a
byte atomically to disk, and HW has its own ideas about when "a write" has
finished.  The HW buffers may not even get written to disk in the order they
were written (smart controllers dynamically reorder writes to minimize head
movement), so it's still possible to get the "transaction succeeded" byte
written to disk before all the data in the transaction appears on disk.

This kind of thing is why POSIX/SUS refuses to make promises about what
fsync() does, but does make this request (apparently honored in the breach
by most claimed-to-be POSIX systems):

    http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html

    ...
    In the middle ground between these extremes, fsync() might or might
    not actually cause data to be written where it is safe from a power
    failure.  The conformance document should identify at least that one
    configuration exists (and how to obtain that configuration) where
    this can be assured for at least some files that the user can select
    to use for critical data.  It is not intended that an exhaustive list
    is required, but rather sufficient information is provided so that
    if critical data needs to be saved, the user can determine how the
    system is to be configured to allow the data to be written to non-
    volatile storage.

[Tim Peters]
>> We could plop in any number of additional flush+fsync calls if someone
>> is having a real problem here, but calling them isn't free (moving disk
>> heads is enormously expensive relative to current CPU speeds),

[Toby]
> Who are you, and what have you done with the real Tim? We could make
> FileStorage run arbitrarily fast if there was no objection to data
> corruption.

The real Tim is skeptical about easy answers to difficult problems -- I
don't have any *evidence* that fsync() actually helps on most Unixy systems,
knowledgeable people have questioned whether it does in real life, and the
relevant standards explicitly refuse to promise that it does anything.
ZODB definitely needs os.fsync() calls on Windows, BTW (Windows doesn't have
fsync(), but Python maps os.fsync() to Microsoft's _commit(), and the latter
does do necessary Windows things).

> Sooner or later this will cause a problem for someone -

I believe that power outages can cause problems.  What I'm skeptical of is
that adding an fsync() call can guarantee to prevent them.  If it in fact
does not, then fostering a false sense of security wouldn't be actual
progress.

> and Im not sure we could distinguish this from any other possible cause.

Clearly not from a bad disk, bad disk controller, stray gamma ray, ...
disasters are disasters.

> I'm a little disappointed that you think its acceptable to make this
> compromise.

It's because I don't know that anything *is* being compromised here.
Chanting "fsync()" appears largely to be superstition to me, based on all I
actually know about the issues.

> ...
> For a realistic example of a ZODB storage that calls fsync exactly the
> right number of times (no more no less) see DirectoryStorage.
> http://dirstorage.sourceforge.net/.
>
> This storage has survived two days of intensive pull-the-power-cord
> testing while under heavy write pressure, while running on ordinary IDE
> hardware.

1. There's nothing you can do manually that's "intensive" relative to
   hardware speeds -- even relatively slow disk speeds.  How many times
   were you able to power cycle in this test?  Is "survive" the same thing
   as "no corruption"?

2. Did you also try this test without your fsync() calls?  If so, what
   was the mean number of power cycles between observed corruptions?

IOW, a relevant test here would compare two scenarios, changing only the one
thing at issue between them, and quantify failure rates.

That said, I haven't stayed up for 48 hours pulling the plug on FileStorage,
and I doubt Jim would pay me to do that <wink>.  If someone wants to try
pulling the plug on FileStorage, with and without "a second" fsync() on
various Unixy systems, and report results, that would be great.  If there's
evidence that more fsync'ing would actually help, I'll be grateful, and
delighted to add it.

> It was designed with a "zero safety compromises" policy - if you need to
> never lose data then this may be a better option that FileStorage.

FWIW, I think DirectoryStorage is very well designed indeed.  OTOH, we
currently have no evidence of any data loss due to FileStorage bugs either
("missing fsync" or otherwise).  The most recent report of FileStorage
corruption was traced to a RAID controller that failed only under heavy
load:

    http://collector.zope.org/Zope/1420

Another fsync() wouldn't have helped him, and would not have helped any
other case of FileStorage corruption I know about where the cause became
known.  OTOH, Zope Corp never sees FileStorage corruption in its own
deployments of Zope, so I've got a very small universe of FileStorage
corruption problems to work with -- and the power in Fredericksburg goes out
often <0.5 wink>.