[Zope-dev] ZEO cache instrumentation -- any takers?

Leonardo Rochael Almeida leo@hiper.com.br
03 Oct 2002 21:41:26 -0300


On Thu, 2002-10-03 at 01:11, Jeremy Hylton wrote:
> >>>>> "LRA" == Leonardo Rochael Almeida <leo@hiper.com.br> writes:
> 
>   LRA> The last two installations belong to the same client, and are
>   LRA> both experiencing a problem that I believe is related to ZEO
>   LRA> latency. Both sites depend heavily on a single ZCatalog for
>   LRA> operation (one ZCatalog hit in almost every page, with some
>   LRA> pages with more than one ZCatalog hit). When site content is
>   LRA> being managed, we're hitting ReadConflict errors (They read
>   LRA> like this: ZODB.POSException.ReadConflictError on database read
>   LRA> conflict error (oid000000000004c0a4, class
>   LRA> BTrees.OOBTree.OOBucket)). The reference to BTrees makes me
>   LRA> think this is ZCatalog related, we don't use BTrees ourselves
>   LRA> anywhere else and we only use regular Folders as bases for our
>   LRA> folderish objects, and managing content causes the reindexing
>   LRA> of objects. Which I think is what is triggering the conflict
>   LRA> errors.
> 
> Your analysis sounds correct.  A read conflict occurs when on thread
> modifies an object while another thread is reading it (subject to
> certain other conditions that probably aren't relevant).  I assume
> several users are managing content at the same time.

Not really. The system is in the last stages of functional testing with
the client (it'd be funny if it wasn't so tragic :-), so we have a
couple of insertions and modifications happening at the same time the
public interface is being viewed.

> If Zope gets a read conflict, it will retry the transaction. So you
> must get very unlikely in order to see the error.  Each time the
> transaction is retried, something must go wrong.

That's the assumption I was working under.

> Are there a lot of concurrent transactions?

Not really, as I said, we're only testing.

> Or do the updates affect a large number of objects at once? 

Not really, unless the ZCatalog update changes a bunch of BTree
subobjects sequentially. Even then, the sleep between retries should
take care of that, unless the ZEO latency is really high, which I don't
think is enough to explain.

> If either is true, it seems like
> it would make the problem more likely to occur.

True, but the problem could be something else entirely:

We setup the standard error page to call an external method that writes
the exception data (Exception, request path and traceback) to zLOG.

One thing I just noticed is this little snippet in the stupid log:

>>>

2002-09-25T19:57:36 ERROR(200) FieldIndex unindex_object could not
remove
documentId 183870929 from index status.  This should not happen.
Traceback (innermost last):
  File
E:\aplic\ZCLIEN~1\lib\python\Products\PluginIndexes\common\UnIndex.py,
line
168, in removeForwardIndexEntry
    (Object: status)
  File E:\aplic\ZCLIEN~1\lib\python\ZODB\Connection.py, line 463, in
setstate
ReadConflictError: database read conflict error (oid 000000000005cffa,
class
BTrees.IIBTree.IITreeSet)

<<<

It seems unindex_object is indiscriminately trapping ReadConflictErrors
instead of letting ZPublisher deal with it. I believe this is not the
cause of us seeing ReadConflictErrors in the browser, but I decided to
point this out since it looks like a bug to me. The message above was
caused by an 'exceptionless' (that a word?) `exception:' clause in
lib/python/Products/PluginIndexes/common/UnIndex.py:178

   And then, the lightning bolt of illumination hits me...

I see ZPublisher generated conflict errors in the stupid log that aren't
followed by the log messages generated by the standard_error_page, but
the conflict errors that ARE trapped by standard_error_page aren't
followed by ZPublisher 'conflict error' messages. So I go searching and
what do I find?

>>> lib/python/Products/PageTemplates/TALES.py:253+
[... def evaluate(self, expression ...]
        except:
            raise TALESError, (`expression`, sys.exc_info(),
                               self.position), sys.exc_info()[2]
        else:
            return v
<<<

So ZPublisher isn't retrying this ConflictError because it isn't getting
any! it's getting a TALESError! which it's dutifully reporting via the
usual mechanisms!

Well, this is an obvious error, but I'm too tired now to think of a fix.
Besides, in a few minuts I have to watch the debate between the
presidential candidates for the election this Sunday.

	Cheers, Leo

PS: Hey, Guido, any chance of deprecating exceptionless 'exception:'
clauses anytime soon? We got a bunch of those making our life harder on
Zope :-)

-- 
Ideas don't stay in some minds very long because they don't like
solitary confinement.