[ZODB-Dev] ZODB cache reimplementation

Jim Fulton jim@zope.com
Fri, 25 Jan 2002 10:02:38 -0500


Toby Dickenson wrote:
> 
> On Thursday 24 January 2002 11:07 pm, you wrote:
> >Toby Dickenson wrote:

...

> I am currently undecided between two implementations. Variant A is very
> similar to what you describe below.
> 
> Variant B does not use weak references. This adds implementation simplicity,
> which is its main advantage. I dont see a clear advantage to automatically
> removing an object just because it has no references. Of course objects with
> no references are likely to remain unused, and the LRU rule will then remove
> them.

Advantages of weak references (and separating ghosts):

- Objects are removed from memory as soon as possible

- There are fewer objects to scan

- You have better information on memory usage, because you track
  how many objects are ghosts.

I really think these advantages are pretty significant. The implementation
is not very complicated.

Some additional notes on my proposal:

- We don't really need the ghost list, as there's no need
  to scan ghosts.

- The active list is not really an active list but a ghostifyable
  list. IOW, it should only contain objects in the up to date
  state.   We might want to keep ghost, and ghostifyable counts
  to inform GC.

> Variant B adds a small memory overhead. We want to keep the garbage
> collection to a single pass through the LRU list (and probably not the whole
> list too). That means that on ghostifying an object in the middle of the
> list, we may remove the last (last but one, including the dict) reference
> count of a previously scanned object. However, it will not be deallocated
> until a subsequent scan.

I don't follow this.
 
> This may also add a performance overhead, if it means we have to check more
> objects in each scan.

Yes.
 
> >- Use your idea of a doubly linked list, with recently
> >  used objects at one end.
> 
> A mammoth hacking session yesterday has producted a first prototype of this
> scheme. So far things look good. My benchmark is a test script which, under
> the old scheme, would use unbounded memory unless I added many calls to the
> incremental garbage collector through my application logic. Under the LRU
> scheme it runs using a constant 23M.

Way cool.
 
> I hope to have something suitable for sharing later today, or monday.

Great.
 
> Also, an unexpected advantage of this is that we avoid a call to time() on
> every attribute access. I think this may be the reason that some
> non-memory-intensive parts of my benchmark are showing a speedup too (on
> linux 2.4)

Yes. This is an important point. Guido was alarmed at calling time on 
every access, as time is a system call and, thus, expensive, at list on 
nixes.

> >- Separate the active objects from the ghosts, since we no longer,
> >  with weak references, need to scan the ghosts.
> 
> yes, that is the advantage of weak references. ghosts take care of themselves.

The other big advantage, as I point out above, is that objects, including
non ghosts, can be freed sooner.
 
> >I imagine that this is close to what you've been thinking of.
> 
> Very close to my variant A.
> 
> >The cache has a dictionary that contains references to intermediate
> > reference objects. The reference objects have references to the cache and
> > to persistent objects. A persistent object is obtained from the cache
> > through the cache reference.
> >
> >A reference object will be in one of two linked lists in the cache, one for
> > ghosts and one for active objects.  The reference moves between lists when
> > the persistent object changes states. List counts are also maintained for
> > use by the cache management software. When an object is accessed, it is
> > moved to back of the active list.
> 
> Some observations:
> 
> Does the persistent object have a reference to its intermediate object?

Yes, instead of the data manager pointer. Given that there are a lot of
persistent objects, we want to keep the memory overhead down.

> In my
> scheme I had assumed not. List pointers are therefore stored in the
> persistent object, not the intermediate, meaning one less indirection on
> every attribute access (which I assume to be a common operation).

I think C indirections are so cheap as to be effectively free, compared
to everything else going on.

> My other
> reason for doing it the other way is so I can share more implementation with
> my Variant B, which does not use intermediate objects.

I think we can avoid the intermediate object, assuming we are careful
about the cache data mapping. I think it is sufficient to be careful
about oid selection (e.g. only simple values).
 
> I dont think any scheme needs a list for ghosts. Why do they need to be
> ordered?

Nope, not needed.
 
> > When the cache does GC, it scans the
> > active list from the front for objects to be removed.  Normally, any object
> > it finds may be removed, unless is is locked. An object is locked when it
> > is accessed from C. Perhaps there should be a separate locked list.
> 
> Ive never used that locking mechanism, but the examples I have to work from
> (BTrees) only lock briefly. I am expecting the GC scanning to encounter few
> locked objects even if they are kept in the active list.

As I mentioned above, we should probably just have a ghostifyable list.
When an object is locked, it is removed from the list. When it is unlocked, 
it is added.
 
> >The cache reference is essentially a weak reference to the object. It does
> >not increase the persistent object's reference count. When a persistent
> > object is deallocated, it (in it's dealloc):
> >
> >  - Removes (zeros) it's reference from the reference object,
> >
> >  - Removes the reference object from whatever list it's in,
> >
> >  - Removes itself from the cache's data mapping by deleting the
> >    key with the objects object id.
> >
> >It's awfully tempting to combine the reference object with the persistent
> >object to avoid the malloc and Python object overhead and the object
> > pointer. There's some  risk that somehow the deletion of the object from
> > the cache's data mapping could fail on deallocation, which would lead to a
> > memory error of some kind later.
> 
> It would be nicer to play fair with the Dictionary implementation, and allow
> him to keep a reference count on his referent.

She does. In my scheme, the dictionary has a counted reference to the
cache reference. If we get rid of the cache reference, then the persistent
object reference would be uncounted and we'd have to be absolutely sure
that we can remove the object from the dictionary. Guido is confident that
this is safe, assuming, that the keys (oid) don't change their value or hash.

Jim

--
Jim Fulton           mailto:jim@zope.com       Python Powered!        
CTO                  (888) 344-4332            http://www.python.org  
Zope Corporation     http://www.zope.com       http://www.zope.org