[Checkins] SVN: BTrees/trunk/ Re-borrow narrative docs from the ZODB Guide.
Tres Seaver
cvs-admin at zope.org
Sat Oct 20 18:39:29 UTC 2012
Log message for revision 128075:
Re-borrow narrative docs from the ZODB Guide.
Changed:
_U BTrees/trunk/
_U BTrees/trunk/BTrees/
_U BTrees/trunk/docs/
U BTrees/trunk/docs/index.rst
-=-
Property changes on: BTrees/trunk
___________________________________________________________________
Modified: svn:ignore
- bin
build
develop-eggs
eggs
parts
.installed.cfg
*.egg-info
+ bin
build
develop-eggs
eggs
parts
.installed.cfg
*.egg-info
.coverage
nosetests.xml
.tox
Property changes on: BTrees/trunk/BTrees
___________________________________________________________________
Modified: svn:ignore
- *.so
+ *.so
coverage.xml
Property changes on: BTrees/trunk/docs
___________________________________________________________________
Added: svn:ignore
+ _build
Modified: BTrees/trunk/docs/index.rst
===================================================================
--- BTrees/trunk/docs/index.rst 2012-10-19 15:35:17 UTC (rev 128074)
+++ BTrees/trunk/docs/index.rst 2012-10-20 18:39:26 UTC (rev 128075)
@@ -11,8 +11,410 @@
.. toctree::
:maxdepth: 2
+Overview
+--------
+When programming with the ZODB, Python dictionaries aren't always what you
+need. The most important case is where you want to store a very large
+mapping. When a Python dictionary is accessed in a ZODB, the whole
+dictionary has to be unpickled and brought into memory. If you're storing
+something very large, such as a 100,000-entry user database, unpickling
+such a large object will be slow. BTrees are a balanced tree data
+structure that behave like a mapping but distribute keys throughout a
+number of tree nodes. The nodes are stored in sorted order (this has
+important consequences -- see below). Nodes are then only unpickled and
+brought into memory as they're accessed, so the entire tree doesn't have to
+occupy memory (unless you really are touching every single key).
+The BTrees package provides a large collection of related data structures.
+There are variants of the data structures specialized to integers, which
+are faster and use less memory. There are five modules that handle the
+different variants. The first two letters of the module name specify the
+types of the keys and values in mappings -- O for any object, I for 32-bit
+signed integer, and (new in ZODB 3.4) F for 32-bit C float. For example,
+the :mod:`BTrees.IOBTree` module provides a mapping with integer keys and
+arbitrary objects as values.
+
+The four data structures provide by each module are a BTree, a Bucket, a
+TreeSet, and a Set. The BTree and Bucket types are mappings and support
+all the usual mapping methods, e.g. :func:`update` and :func:`keys`. The
+TreeSet and Set types are similar to mappings but they have no values; they
+support the methods that make sense for a mapping with no keys, e.g.
+:func:`keys` but not :func:`items`. The Bucket and Set types are the
+individual building blocks for BTrees and TreeSets, respectively. A Bucket
+or Set can be used when you are sure that it will have few elements. If
+the data structure will grow large, you should use a BTree or TreeSet. Like
+Python lists, Buckets and Sets are allocated in one contiguous piece, and
+insertions and deletions can take time proportional to the number of
+existing elements. Also like Python lists, a Bucket or Set is a single
+object, and is pickled and unpickled in its entirety. BTrees and TreeSets
+are multi-level tree structures with much better (logarithmic) worst- case
+time bounds, and the tree structure is built out of multiple objects, which
+ZODB can load individually as needed.
+
+The five modules are named :mod:`OOBTree`, :mod:`IOBTree`, :mod:`OIBTree`,
+:mod:`IIBTree`, and (new in ZODB 3.4) :mod:`IFBTree`. The two letter
+prefixes are repeated in the data types names. The :mod:`BTrees.OOBTree`
+module defines the following types: :class:`OOBTree`, :class:`OOBucket`,
+:class:`OOSet`, and :class:`OOTreeSet`. Similarly, the other four modules
+each define their own variants of those four types.
+
+The :func:`keys`, :func:`values`, and :func:`items` methods on BTree and
+TreeSet types do not materialize a list with all of the data. Instead,
+they return lazy sequences that fetch data from the BTree as needed. They
+also support optional arguments to specify the minimum and maximum values
+to return, often called "range searching". Because all these types are
+stored in sorted order, range searching is very efficient.
+
+The :func:`keys`, :func:`values`, and :func:`items` methods on Bucket and
+Set types do return lists with all the data. Starting in ZODB 3.3, there
+are also :func:`iterkeys`, :func:`itervalues`, and :func:`iteritems`
+methods that return iterators (in the Python 2.2 sense). Those methods
+also apply to BTree and TreeSet objects.
+
+A BTree object supports all the methods you would expect of a mapping, with
+a few extensions that exploit the fact that the keys are sorted. The
+example below demonstrates how some of the methods work. The extra methods
+are :func:`minKey` and :func:`maxKey`, which find the minimum and maximum
+key value subject to an optional bound argument, and :func:`byValue`, which
+should probably be ignored (it's hard to explain exactly what it does, and
+as a result it's almost never used -- best to consider it deprecated). The
+various methods for enumerating keys, values and items also accept minimum
+and maximum key arguments ("range search"), and (new in ZODB 3.3) optional
+Boolean arguments to control whether a range search is inclusive or
+exclusive of the range's endpoints.
+
+.. doctest::
+
+ >>> from BTrees.OOBTree import OOBTree
+ >>> t = OOBTree()
+ >>> t.update({1: "red", 2: "green", 3: "blue", 4: "spades"})
+ >>> len(t)
+ 4
+ >>> t[2]
+ 'green'
+ >>> s = t.keys() # this is a "lazy" sequence object
+ >>> s
+ <OOBTreeItems object at ...>
+ >>> len(s) # it acts like a Python list
+ 4
+ >>> s[-2]
+ 3
+ >>> list(s) # materialize the full list
+ [1, 2, 3, 4]
+ >>> list(t.values())
+ ['red', 'green', 'blue', 'spades']
+ >>> list(t.values(1, 2)) # values at keys in 1 to 2 inclusive
+ ['red', 'green']
+ >>> list(t.values(2)) # values at keys >= 2
+ ['green', 'blue', 'spades']
+ >>> list(t.values(min=1, max=4)) # keyword args new in ZODB 3.3
+ ['red', 'green', 'blue', 'spades']
+ >>> list(t.values(min=1, max=4, excludemin=True, excludemax=True))
+ ['green', 'blue']
+ >>> t.minKey() # smallest key
+ 1
+ >>> t.minKey(1.5) # smallest key >= 1.5
+ 2
+ >>> for k in t.keys():
+ ... print k,
+ 1 2 3 4
+ >>> for k in t: # new in ZODB 3.3
+ ... print k,
+ 1 2 3 4
+ >>> for pair in t.iteritems(): # new in ZODB 3.3
+ ... print pair,
+ ...
+ (1, 'red') (2, 'green') (3, 'blue') (4, 'spades')
+ >>> t.has_key(4) # returns a true value, but exactly what undefined
+ 2
+ >>> t.has_key(5)
+ 0
+ >>> 4 in t # new in ZODB 3.3
+ True
+ >>> 5 in t # new in ZODB 3.3
+ False
+ >>>
+
+
+Each of the modules also defines some functions that operate on BTrees --
+:func:`difference`, :func:`union`, and :func:`intersection`. The
+:func:`difference` function returns a Bucket, while the other two methods return
+a Set. If the keys are integers, then the module also defines
+:func:`multiunion`. If the values are integers or floats, then the module also
+defines :func:`weightedIntersection` and :func:`weightedUnion`. The function
+doc strings describe each function briefly.
+
+.. % XXX I'm not sure all of the following is actually correct. The
+.. % XXX set functions have complicated behavior.
+
+``BTrees/Interfaces.py`` defines the operations, and is the official
+documentation. Note that the interfaces don't define the concrete types
+returned by most operations, and you shouldn't rely on the concrete types that
+happen to be returned: stick to operations guaranteed by the interface. In
+particular, note that the interfaces don't specify anything about comparison
+behavior, and so nothing about it is guaranteed. In ZODB 3.3, for example, two
+BTrees happen to use Python's default object comparison, which amounts to
+comparing the (arbitrary but fixed) memory addresses of the BTrees. This may or
+may not be true in future releases. If the interfaces don't specify a behavior,
+then whether that behavior appears to work, and exactly happens if it does
+appear to work, are undefined and should not be relied on.
+
+
+Total Ordering and Persistence
+------------------------------
+
+The BTree-based data structures differ from Python dicts in several fundamental
+ways. One of the most important is that while dicts require that keys support
+hash codes and equality comparison, the BTree-based structures don't use hash
+codes and require a total ordering on keys.
+
+Total ordering means three things:
+
+#. Reflexive. For each *x*, ``x == x`` is true.
+
+#. Trichotomy. For each *x* and *y*, exactly one of ``x < y``, ``x == y``, and
+ ``x > y`` is true.
+
+#. Transitivity. Whenever ``x <= y`` and ``y <= z``, it's also true that ``x <=
+ z``.
+
+The default comparison functions for most objects that come with Python satisfy
+these rules, with some crucial cautions explained later. Complex numbers are an
+example of an object whose default comparison function does not satisfy these
+rules: complex numbers only support ``==`` and ``!=`` comparisons, and raise an
+exception if you try to compare them in any other way. They don't satisfy the
+trichotomy rule, and must not be used as keys in BTree-based data structures
+(although note that complex numbers can be used as keys in Python dicts, which
+do not require a total ordering).
+
+Examples of objects that are wholly safe to use as keys in BTree-based
+structures include ints, longs, floats, 8-bit strings, Unicode strings, and
+tuples composed (possibly recursively) of objects of wholly safe types.
+
+It's important to realize that even if two types satisfy the rules on their own,
+mixing objects of those types may not. For example, 8-bit strings and Unicode
+strings both supply total orderings, but mixing the two loses trichotomy; e.g.,
+``'x' < chr(255)`` and ``u'x' == 'x'``, but trying to compare ``chr(255)`` to
+``u'x'`` raises an exception. Partly for this reason (another is given later),
+it can be dangerous to use keys with multiple types in a single BTree-based
+structure. Don't try to do that, and you don't have to worry about it.
+
+Another potential problem is mutability: when a key is inserted in a BTree-
+based structure, it must retain the same order relative to the other keys over
+time. This is easy to run afoul of if you use mutable objects as keys. For
+example, lists supply a total ordering, and then
+
+.. doctest::
+
+ >>> L1, L2, L3 = [1], [2], [3]
+ >>> from BTrees.OOBTree import OOSet
+ >>> s = OOSet((L2, L3, L1)) # this is fine, so far
+ >>> list(s.keys()) # note that the lists are in sorted order
+ [[1], [2], [3]]
+ >>> s.has_key([3]) # and [3] is in the set
+ 1
+ >>> L2[0] = 5 # horrible -- the set is insane now
+ >>> s.has_key([3]) # for example, it's insane this way
+ 0
+ >>> s
+ OOSet([[1], [5], [3]])
+ >>>
+
+Key lookup relies on that the keys remain in sorted order (an efficient form of
+binary search is used). By mutating key L2 after inserting it, we destroyed the
+invariant that the OOSet is sorted. As a result, all future operations on this
+set are unpredictable.
+
+A subtler variant of this problem arises due to persistence: by default, Python
+does several kinds of comparison by comparing the memory addresses of two
+objects. Because Python never moves an object in memory, this does supply a
+usable (albeit arbitrary) total ordering across the life of a program run (an
+object's memory address doesn't change). But if objects compared in this way
+are used as keys of a BTree-based structure that's stored in a database, when
+the objects are loaded from the database again they will almost certainly wind
+up at different memory addresses. There's no guarantee then that if key K1 had
+a memory address smaller than the memory address of key K2 at the time K1 and K2
+were inserted in a BTree, K1's address will also be smaller than K2's when that
+BTree is loaded from a database later. The result will be an insane BTree,
+where various operations do and don't work as expected, seemingly at random.
+
+Now each of the types identified above as "wholly safe to use" never compares
+two instances of that type by memory address, so there's nothing to worry about
+here if you use keys of those types. The most common mistake is to use keys
+that are instances of a user-defined class that doesn't supply its own
+:meth:`__cmp__` method. Python compares such instances by memory address. This
+is fine if such instances are used as keys in temporary BTree-based structures
+used only in a single program run. It can be disastrous if that BTree-based
+structure is stored to a database, though.
+
+.. doctest::
+
+ >>> class C:
+ ... pass
+ ...
+ >>> a, b = C(), C()
+ >>> print a < b # this may print 0 if you try it
+ True
+ >>> del a, b
+ >>> a, b = C(), C()
+ >>> print a < b # and this may print 0 or 1
+ False
+ >>>
+
+That example illustrates that comparison of instances of classes that don't
+define :meth:`__cmp__` yields arbitrary results (but consistent results within a
+single program run).
+
+Another problem occurs with instances of classes that do define :meth:`__cmp__`,
+but define it incorrectly. It's possible but rare for a custom :meth:`__cmp__`
+implementation to violate one of the three required formal properties directly.
+It's more common for it to "fall back" to address-based comparison by mistake.
+For example,
+
+.. doctest::
+
+ class Mine:
+ def __cmp__(self, other):
+ if other.__class__ is Mine:
+ return cmp(self.data, other.data)
+ else:
+ return cmp(self.data, other)
+
+It's quite possible there that the :keyword:`else` clause allows a result to be
+computed based on memory address. The bug won't show up until a BTree-based
+structure uses objects of class :class:`Mine` as keys, and also objects of other
+types as keys, and the structure is loaded from a database, and a sequence of
+comparisons happens to execute the :keyword:`else` clause in a case where the
+relative order of object memory addresses happened to change.
+
+This is as difficult to track down as it sounds, so best to stay far away from
+the possibility.
+
+You'll stay out of trouble by follwing these rules, violating them only with
+great care:
+
+#. Use objects of simple immutable types as keys in BTree-based data structures.
+
+#. Within a single BTree-based data structure, use objects of a single type as
+ keys. Don't use multiple key types in a single structure.
+
+#. If you want to use class instances as keys, and there's any possibility that
+ the structure may be stored in a database, it's crucial that the class define a
+ :meth:`__cmp__` method, and that the method is carefully implemented.
+
+ Any part of a comparison implementation that relies (explicitly or implicitly)
+ on an address-based comparison result will eventually cause serious failure.
+
+#. Do not use :class:`Persistent` objects as keys, or objects of a subclass of
+ :class:`Persistent`.
+
+That last item may be surprising. It stems from details of how conflict
+resolution is implemented: the states passed to conflict resolution do not
+materialize persistent subobjects (if a persistent object P is a key in a BTree,
+then P is a subobject of the bucket containing P). Instead, if an object O
+references a persistent subobject P directly, and O is involved in a conflict,
+the states passed to conflict resolution contain an instance of an internal
+:class:`PersistentReference` stub class everywhere O references P. Two
+:class:`PersistentReference` instances compare equal if and only if they
+"represent" the same persistent object; when they're not equal, they compare by
+memory address, and, as explained before, memory-based comparison must never
+happen in a sane persistent BTree. Note that it doesn't help in this case if
+your :class:`Persistent` subclass defines a sane :meth:`__cmp__` method:
+conflict resolution doesn't know about your class, and so also doesn't know
+about its :meth:`__cmp__` method. It only sees instances of the internal
+:class:`PersistentReference` stub class.
+
+
+Iteration and Mutation
+----------------------
+
+As with a Python dictionary or list, you should not mutate a BTree-based data
+structure while iterating over it, except that it's fine to replace the value
+associated with an existing key while iterating. You won't create internal
+damage in the structure if you try to remove, or add new keys, while iterating,
+but the results are undefined and unpredictable. A weak attempt is made to
+raise :exc:`RuntimeError` if the size of a BTree-based structure changes while
+iterating, but it doesn't catch most such cases, and is also unreliable.
+Example
+
+.. doctest::
+ :options: +SKIP
+
+ >>> from BTrees.IIBTree import *
+ >>> s = IISet(range(10))
+ >>> list(s)
+ [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+ >>> for i in s: # the output is undefined
+ ... print i,
+ ... s.remove(i)
+ 0 2 4 6 8
+ Traceback (most recent call last):
+ File "<stdin>", line 1, in ?
+ RuntimeError: the bucket being iterated changed size
+ >>> list(s) # this output is also undefined
+ [1, 3, 5, 7, 9]
+ >>>
+
+Also as with Python dictionaries and lists, the safe and predictable way to
+mutate a BTree-based structure while iterating over it is to iterate over a copy
+of the keys. Example
+
+.. doctest::
+
+ >>> from BTrees.IIBTree import *
+ >>> s = IISet(range(10))
+ >>> for i in list(s.keys()): # this is well defined
+ ... print i,
+ ... s.remove(i)
+ 0 1 2 3 4 5 6 7 8 9
+ >>> list(s)
+ []
+ >>>
+
+
+BTree Diagnostic Tools
+----------------------
+
+A BTree (or TreeSet) is a complex data structure, really a graph of variable-
+size nodes, connected in multiple ways via three distinct kinds of C pointers.
+There are some tools available to help check internal consistency of a BTree as
+a whole.
+
+Most generally useful is the :mod:`BTrees.check` module. The
+:func:`check.check` function examines a BTree (or Bucket, Set, or TreeSet) for
+value-based consistency, such as that the keys are in strictly increasing order.
+See the function docstring for details. The :func:`check.display` function
+displays the internal structure of a BTree.
+
+BTrees and TreeSets also have a :meth:`_check` method. This verifies that the
+(possibly many) internal pointers in a BTree or TreeSet are mutually consistent,
+and raises :exc:`AssertionError` if they're not.
+
+If a :func:`check.check` or :meth:`_check` call fails, it may point to a bug in
+the implementation of BTrees or conflict resolution, or may point to database
+corruption.
+
+Repairing a damaged BTree is usually best done by making a copy of it. For
+example, if *self.data* is bound to a corrupted IOBTree,
+
+.. doctest::
+
+ self.data = IOBTree(self.data)
+
+usually suffices. If object identity needs to be preserved,
+
+.. doctest::
+
+ acopy = IOBTree(self.data)
+ self.data.clear()
+ self.data.update(acopy)
+
+does the same, but leaves *self.data* bound to the same object.
+
+
+
+
Indices and tables
==================
More information about the checkins
mailing list