[ZODB-Dev] Anybody using ZODB with no calls to fsync in production?

Fri Oct 13 15:09:24 EDT 2006

[Roché Compaan]
>> ...
>> http://mail.zope.org/pipermail/zodb-dev/2004-July/007682.html
>> ...

[Lennart Regebro]
> I read this thread, and it seems to me that the ultimate solution
> would be to have a setting for FSStorage, say "fsync-behaviour" with
> the options of "single", "double", "none" or "interval". We'd need an
> explaining text too. Something like:
>
> fsync-behaviour: Determines when fsync is called. Default: single.
>
> Options:
>
> Single: Calls fsync once per transaction. Gives you reasonable data
> reliability in most cases. You should in a crash only lose one
> transaction.

Note that the original complaint was that single-fsync could
theoretically leave the last transaction /claiming/ it completed
successfully despite that the transaction data wasn't fully written to
disk.  That's a case of Data.fs corruption (and possibly undetectable,
by ftest.py or fsrefs.py, if you were extremely unlucky) rather than
of merely losing a transaction.

> Double: Calls fsync before marking transaction as complete as well as
> after marking it as complete.

Which theoretically worms around the above by saving all the
transaction's data before marking the transaction complete.  Note that
it's /still/ possible to lose the last transaction (if the box crashes
after writing the transaction's data but before marking the
transaction complete).  IOW, the failure mode attributed to the
single-fsync case above actually belongs to the double-fsync case too.
 The theoretical failure mode for the single-fsync case is much worse.

> This setting is only useful if you have configured the complete storage
> chain (operating system, filesystem, drivers, controllers and disks) will
> not let fsync return until data is safely and completely written to disk. In
> most cases and without configuration of the complete storage chain, this
> setting will slow down FSStorage without actually increasing the
> reliability of data written to disk.
>
> Interval: Will call fsync only every couple of transactions, with the
> interval determined by the setting "fsync-interval". This is good for
> write-intensive applications where you don't mind loosing a couple of
> transactions if the computer should crash.

There's again the worse possibility of corruption.  Disk controllers
typically reorder pending writes to minimize head movement, so
transaction bytes may not be stored to disk in the order written by
software.  If the box crashes "in the middle", there's no guessing
what's left on disk.

> None: Will never call fsync. Good for applications of high write load
> where the data is not critical. Never ever use this setting on
> windows, as it on windows makes it highly likely that data will not be
> written to disk at all, and a crash could quite likely make you loose
> all your changes.

In truth, I expect that Windows is in exactly the same boat here:  "it
depends" on a gazillion things few people know about and that are hard
to find out.  For example, there are knobs in Windows that purport to
disable write caching on a per-hard-drive basis.  God only knows what
that truly means.

It's a good proposal, so extend it :-)  FileStorage should also grow a
method saying "do all you can to ensure that everything done so far is
committed to disk".  This would consist of:

1. Flush and fsync the .fs file (note that ZODB works at stream (FILE *) level,
   not at filehandle ("little integer") level; flush() works at the
former level and
   fsync() at the latter; it does no good to fsync if the stdio stream buffers
   still hold /some/ of the data, which is why flushing is necessary before an
   fsync).

2. Update the .index file (ZODB currently does that only "when it feels like
   it", roughly each 10000 data records written, and when a FileStorage is
   closed).

3. Flush and fsync the .index file (ZODB currently never fsyncs the .index
   file).