[ZODB-Dev] return of corruption errors

Tim Peters tim at zope.com
Tue Oct 19 18:47:04 EDT 2004


[Gerry Kirk]
>>> The strange thing is, I don't get errors every night.

[Tim Peters]
>> That's not strange if it *is* a HW (etc) bug.  It would be strange if it
>> were a deterministic error in the ZODB/ZEO code.

[Gerry]
> Yes, but this is on the same, unrepaired Data.fs file. One night, it has
> an issue, another night, it doesn't, when running fstest.py.

The same file?  If so, then you definitely have a major system problem of
some kind.  There's nothing non-deterministic about what fstest.py does:
feed it the same file a trillion times, and it will do exactly the same
thing a trillion times.  If in one of those trillion times it does something
different, you have a HW or system SW problem of some sort.  Well, at "a
trillion" even a stray gamma ray could come into play, but you get the idea
...

Random idea:  run your system's md5sum on it from time to time, and see
whether you get the same digest every time.  If you don't, the nature of the
problem gets much clearer.  If you do get the same digest every time, this
could branch in many directions.  I'll note that if you typically have to
wait a day or so before fstest.py fails again, that suggests a disk-related
problem.

> I do agree, though, that there appears to be an issue here.

It should be impossible for ftest.py to display different behavior on
different runs against the same input.  If you haven't yet, bring up
fstest.py in an editor.  It's not that big.  There are no threads, no calls
to random.random() <wink>, everything it does is wholly determined by the
values in the bytes it reads from the file.  It does seeks on the file, so
that's one thing more stressful than md5sum would do.

> We installed some memory about a month back which had some conflict
> issues and froze up the system a few times.

Well, that is scary.  If the memory system delivered "bad bytes" on any load
or store *before* the system froze, those could have ended up anywhere on
your disk, including corrupting system software.

> Since then, I successfully ran fstest.py and fsrefs.py, then packed the
> db to 1 day, to a size smaller than where the current issue is. I've
> been running those tests on it nightly.

I can just repeat that there's nothing in fstest.py that should vary across
runs, and I'll add that nobody has reported such a thing before.  Note that
because it's coded in Python, it's not possible that, e.g., fstest.py uses
the value of an uninitialized variable, or does any of a dozen other things
that can cause unpredictable behavior in C code.  In Python, those kinds of
things raise exceptions instead.



More information about the ZODB-Dev mailing list