[Zope] XRON and simultaneous triggers

Capesius, Alan CapesiusA@Sysmex.com
Thu, 21 Jun 2001 18:07:23 -0500


A bit more info on my crash. (NTS 4.0 SP6 Zope2.30)
At the time the crash occurred, an out of virtual memory was also displayed
at the server and the server had memory usage above 1GB RAM. The active
python task was not consuming this (nor was anything else, but the service
had been restarted at that point. Global RAM usage?

I've also disabled the error dump messages so that Zope will restart itself
without human intervention when it bombs. Though, in this case it was still
dumping a crash dump file, which takes awhile. I've now disabled crash
dumps. 

Since I was unable to access the server to abort or modify any methods, I
was forced to restart it. This did not resolve the xron problem, as the
scheduler activated immediately and started looping again. I was able to
resolve this by downing the service, removing the xron folder, restarting,
removing the offending ODBC/SQL call in the xron method that I suspected and
restarting again with the XRon folder back in place.

The offending call was a single record update to an Access DB. I am now in
the process of migrating all live query functions over to SQL Server based
tables. I'm not sure if the jet drivers in the O/S were bogging down. My
suspicion (without much evidence) is that the Jet connection was flipping
open and closed so rapidly that something got out of whack with the timing.
(but that theory's more about what I don't know that what I do know :)

Thanks again Loren for the excellent insights into xron. Now i just have to
learn ZClasses, python, and the meaning of life....



>>>-----Original Message-----
>>>From: Loren Stafford [mailto:lstafford@morphics.com]
>>>Sent: Thursday, June 21, 2001 5:55 PM 
>>>To: Capesius, Alan; zope@zope.org
>>>Subject: RE: [Zope] XRON and simultaneous triggers
>>>
>>>
>>>The Xron Dispatcher executes ready events serially. This is 
>>>due to the fact
>>>that Dispatcher consists of a single thread and fires 
>>>scheduled methods with
>>>Client.py, so it has to wait until it gets a response from 
>>>Client.py before
>>>it can service the next ready event.
>>>
>>>So I wouldn't expect to find that Xron contributes to 
>>>concurrency problems
>>>by itself. Of course it might trigger a method that accesses 
>>>your database
>>>at the same time that a real user is accessing the database, 
>>>but that should
>>>be a trivial problem.
>>>
>>>The Dispatcher tries to avoid endless loops when something 
>>>goes wrong in a
>>>scheduled method. First it tries to disarm the scheduled 
>>>method; failing
>>>that, it deletes the corresponding entry from the Schedule catalog.
>>>
>>>It disarms the scheduled method by calling (again via 
>>>Client.py) disarm(), a
>>>private method of the Xron DTML Method, and rescheduling the 
>>>method for
>>>"never". Its attempt to disarm may fail, and the entries you 
>>>are seeing in
>>>the log file show that it is failing. There are only a few 
>>>conditions that
>>>can cause a failure at this point: insufficient priviledges 
>>>to execute the
>>>disarm() method, incorrect URL of the method, failure of the 
>>>socket used by
>>>Client.py, failure to update the Schedule catalog when 
>>>disarming the method.
>>>
>>>In case the Dispatcher fails to disarm a Xron DTML Method, 
>>>it still tries to
>>>prevent looping by deleting the corresponding entry from the Schedule
>>>catalog. I think the only failure that can occur at this 
>>>point is a Schedule
>>>catalog update failure. If it can't change or delete the 
>>>schedule entry from
>>>the catalog, it will find it there again the next time it 
>>>wakes up (which
>>>will be immediately) and try to trigger it again.
>>>
>>>I don't know why it would not be possible to update the 
>>>Schedule at this
>>>point. But I have twice seen such a failure on my system (NT 
>>>4, Zope 2.3.2).
>>>In fact, just yesterday I received 163 email messages 
>>>telling me to take out
>>>the garbage. 8-0
>>>
>>>If anyone has any ideas about how the Dispatcher could fail 
>>>to disarm a
>>>failed event, I'd be happy to listen. In fact, someone who 
>>>is familiar with
>>>ZODB programming should look over the ZODB and transaction logic in
>>>Dispatcher.py and check whether I am doing something wrong 
>>>there. I really
>>>don't understand everything I wrote. 8-)
>>>
>>>In fact, why don't I just include it below.
>>>
>>>-- Loren
>>>
>>>#####################################################################
>>>""" Dispatcher for Xron Events """
>>>
>>>#The Dispatcher runs as a separate thread
>>># It uses the Schedule as its primary data structure
>>># It knows which event is next
>>># It sleeps until that event or a change in the Schedule
>>># It fires an event and logs its output
>>>
>>>import Loggerr
>>>loggerr=Loggerr.loggerr
>>>#import pdb; pdb.set_trace()
>>>
>>>from Globals import DateTime
>>>import sys, string
>>>
>>>maxwait=float(10)  # max time to wait between wake-ups in seconds
>>>
>>>def Timer(ScheduleID, ScheduleChange, rpc):     # aka Dispatcher
>>>  #loggerr(0, 'Dispatcher thread started.')
>>>  # infinite loop
>>>  while 1:
>>>    # Good morning. We just woke up.
>>>    # The first thing we need is a new connection.
>>>    import Zope
>>>    app=Zope.app()
>>>    try:
>>>      Schedule = getattr(app, ScheduleID, None)
>>>    except:
>>>      loggerr(301, 'Cannot access catalog. Suspending operation.')
>>>      break
>>>
>>>    interval=maxwait # Default sleep time. May be recalculated below
>>>    try:
>>>      (atime, aurl)=Schedule.armed_event() # Get next armed event
>>>    except:
>>>      loggerr(302,'Cannot access catalog. Suspending operation.')
>>>      break # out of infinite loop
>>>    if atime is None:
>>>      #loggerr(0, 'No armed events.') # debug
>>>      pass     # Sleep some more
>>>    elif atime.isFuture(): # The next armed event is not yet ready
>>>      #   calculate how long we have to wait
>>>      ainterval=atime.timeTime() - DateTime().timeTime()
>>>      if ainterval < float(0): ainterval=float(0) # Is negative bad?
>>>      interval=ainterval # Comment out for debugging to 
>>>limit to maxwait
>>>    else:                        # This event is ready now.
>>>      # Fire event, and log its output
>>>      emsg= '\nTrigger event: %s\nTrigger time: %s' % (aurl, atime)
>>>      furl=string.join((aurl, 'trigger'), '/')
>>>      try:
>>>        (headers,response)=rpc(furl) # Fire event
>>>        dmsg='%s\n' % response
>>>        loggerr(0, emsg, detail=dmsg) # Log the event and its output
>>>      except:
>>>        type, val, tb = sys.exc_info()
>>>        dmsg="Failed to trigger event.\nType=%s\nVal=%s\n" % 
>>>(type, val)
>>>        loggerr(100, emsg, detail=dmsg)
>>>        del type, val, tb, dmsg
>>>        try:
>>>          rpc('%s/%s' % (aurl, 'disarm')) # Attempt to disarm
>>>          loggerr(100, 'Disarmed event', detail='')
>>>        except:
>>>          # aurl is probably pointing to an event that no 
>>>longer exists
>>>          # or the url doesn't resolve correctly
>>>          loggerr(100, "Failed to disarm event", detail='')
>>>          # Let's just kick it out of the catalog
>>>          # Otherwise, this event will come back to haunt us
>>>          try:
>>>            Schedule.exterminate(aurl)
>>>            get_transaction().commit()
>>>          except:
>>>            pass
>>>      # Finished processing one event
>>>      app._p_jar.sync() # see ZODB/Connection.py
>>>      interval=float(0)
>>>
>>>    # We're going to sleep now; so, free the connection
>>>    app._p_jar.close()
>>>    del app
>>>    #  Sleep for predetermined interval
>>>    #emsg= 'Going to sleep for %s seconds' % (interval)
>>>    ScheduleChange.wait(interval) # in seconds (float)
>>>
>>>    if ScheduleChange.isSet():
>>>      #loggerr(0, 'Awakened by set event.') # debug
>>>      ScheduleChange.clear()
>>>      # Schedule has changed, we woke up early
>>>      # Loop back to the top and check for
>>>      # an earlier event than we were waiting for
>>>    #else:
>>>      #loggerr(0, 'Awakened by timeout.') # debug
>>>      # We timed out, so there must be a ready event
>>>      # Loop back to the top and trigger the next ready event.
>>>      # That's probably the one we were waiting for.
>>>      #pass
>>>
>>>  # End of while 1 loop.
>>>  # Something bad happened, let's clean up before quitting for good
>>>  try:
>>>    app._p_jar.close()
>>>    del app
>>>  except:
>>>    pass
>>>  loggerr(100, 'Dispatcher thread is terminating.')
>>>
>>>============== end ================
>>>