[ZODB-Dev] Use of fsync in FileStorage

Tue Jul 27 00:39:19 EDT 2004

[Shane Hathaway]
> It concerns me that we rely on fsync to maintain the integrity of a
> FileStorage.

I think we rely much more on FileStorage's simplicity ("append") than on
fsync.  In the python-dev thread I referenced earlier, Martin was arguing
that fsync wasn't doing anything for ZODB.  That thread was prompted by the
accidental omission of os.fsync() in Python 2.3 on POSIX systems, and the
argument was whether it was worth pushing out 2.3.1 in a hurry to repair
that glitch.  In addition, before I implemented os.fsync() for Windows in
Python 2.3 (which calls MS _commit(), which in turn calls the Win32
FlushFileBuffers()), os.fsync() didn't exist on Windows at all.  Therefore
no Windows user ever had a ZODB that used fsync before Python 2.3 -- and no
complaints or known problems came from that, despite (as you know) that a
surprising number of ZODB and Zope installations do run on Windows.  It's
curious that I have yet to hear of an .fs corruption problem on Windows, btw
-- maybe fsync() was *causing* corruption on Linux <wink>.

os.fsync() may be a real help (and definitely is on Windows) in case of
power outage, but I'm not sure a case has been made that it has any other
good effect.  Who runs a serious site without backup power?!

> If there are a lot of ZODB users whose hardware prevents fsync from
> doing its job correctly,

I don't know, and don't know of any way to find out.  I'll note that MS
_commit() is widely rumored *not* to guarantee that bytes are actually on
disk (the MS docs aren't clear on this point).  It does send the OS-level
buffers *to* disk, but plenty can still go wrong at that point.

> ZODB needs to detect corruption after the fact. In particular, ZODB
> won't currently detect the situation where nearly all of a
> transaction was written to disk except for a hole in the middle that
> was scheduled to be written last.

Toby seems to believe that fsync() prevents that (if ...).  FWIW, the few
cases of FileStorage corruption I've looked at or known about usually look
like this:  a contiguous sequence of nonsense bytes exists in the .fs file,
both beginning and ending "in the middle" of otherwise-undamaged data
records and spanning at least one transaction boundary.  Usually these bytes
are all NUL (0); once they appeared to be a slice of some object pickle.
They're usually not "near the end" of the file, neither are the damage
endpoints at disk-block sized file offsets.  Since both ends of the damage
aren't at positions FileStorage ever seeks to, neither at boundaries where
FileStorage begins or ends writing data, there's scant plausible explanation
short of severe HW or system SW failure.  A power outage wouldn't do this,
unless perhaps the disk went nuts while spinning down, spraying gibberish
into seemingly random (but contiguous) locations.

> I wonder if including an md5 sum with every transaction would serve our
> needs better.  With md5 sums, we could eliminate concerns about random
> corruption and partial transactions.  We could make it optional for
> FileStorage to call fsync, since fsync in general hurts performance and
> disk life by forcing head movement.

When would transaction checksums get verified?  Since FileStorage generally
reads only one data record at a time, if we wanted ongoing verification it
would be more useful to have a checksum per data record.  If it's a "stop
the world and read everything in the file" kind of deal, then the
combination of fstest and fsrefs gives pretty good coverage already.  It
would be better coverage if data records for versions and non-current
revisions were loaded, and neither of those programs does that now.  It's
easy to add that to fstest, though, and I intend to do that for ZODB 3.3
(also to make fstest safe to use against a live FileStorage).  An example of
a vulnerability remaining then is that it still wouldn't detect random bytes
sprayed into the middle of the pickle for a string.

BTW, repozo.py already saves md5 checksums for the slices of an .fs file it
backs up.  These are recorded in the backup directory's .dat files, at the
time each backup is made (repozo's optional -Q optimization uses these to
help guess whether a full backup is needed, without reconstructing and
comparing every byte backed up so far).  So people regularly running repozo
could, in theory, get md5 verification of every .fs byte up to the time of
their most recent backup.  repozo doesn't have an option to do that now, and
the repozo command line is already too confusing, but maybe that would make
a nice additional script.

OTOH, repozo works purely by copying bytes, and will blithely copy anything
in the .fs, corrupted or not.

So if FileStorage corruption were a frequent problem, I'd vote to add
data-record checksums, and verify them on every object read-from-file.  But
.fs corruption just isn't a frequent problem, and changing the .fs format
wouldn't be much fun for anyone (it wasn't exactly designed with change in
mind <wink>).