[ZODB-Dev] Use of fsync in ZODB

Fri Jul 23 17:31:47 EDT 2004

[Marius Gedminas]
> I've been reading http://zope.org/Wikis/ZODB/FileStorageBackup (a
> wonderful resource, thanks Tim!).

You're welcome.  Make it more wonderful by adding your secret knowledge to
it too <wink>.

> Among other things, it says:
>
>    There's one exception to the rule that bytes in a FileStorage are
>    never overwritten. When a new transaction is committed to a
>    FileStorage, a special status byte is first written to record that a
>    transaction record append has started. When the append is complete,
>    this status byte (near the start of the new transaction record) is
>    overwritten with a value indicating that the commit completed
>    successfully. If, for example, the computer crashes before the append
>    is complete, the next time the FileStorage is opened the status byte
>    still has its initial "append started but didn't complete" value, and
>    the FileStorage is then truncated, to remove the incomplete append.
>
> After reading this paragraph I assumed that ZODB would call os.fsync
> before overwriting the transaction status byte (and call os.fsync again
> after overwriting it).  This turns out not to be the case -- in ZODB
> 3.2.2 (bundled with Zope 2.7.1) os.fsync is only called once, *after*
> overwriting the transaction status byte, if I understood the code
> correctly.

That's true.

> Doesn't this mean that if the system suddenly crashes in the middle of
> os.fsync, the Data.fs on disk will contain an incomplete transaction, but
> the transaction status byte would claim that the transaction is complete.
> Wouldn't that be bad?

If that happened, perhaps.  POSIX doesn't define facilities that guarantee
to keep file contents "as expected" in case of system crashes -- fsync() is
as close as it gets, and POSIX defines almost nothing about what fsync()
must do; indeed:

    it is explicitly intended that a null implementation is permitted

and even if fsync() completes,

   fsync() might or might not actually cause data to be written where it
   is safe from a power failure

Those are from the current POSIX spec, at

    http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html

If fsync() on your box doesn't guarantee to act in an atomic ("all or
nothing, crash or no crash") fashion, then that's a
quality-of-implementation issue between you and your OS vendor.  According
to point #5 in

   http://mail.python.org/pipermail/python-dev/2003-September/038375.html

Linux will allow fsync() to return even if the data isn't actually on disk.

We could plop in any number of additional flush+fsync calls if someone is
having a real problem here, but calling them isn't free (moving disk heads
is enormously expensive relative to current CPU speeds), and if an OS has an
unreliable fsync() I'm not sure it can be made more reliable just by calling
it a lot <wink>.