[ZODB-Dev] Large BTrees

Tue Dec 16 16:14:52 EST 2003

[Gfeller Martin]
> I used APL in the nineties, certainly a high level, interpreted
> language with non-trivial data structures (and as readable as Perl
> ;-)).
>
> Support for memory profiling was good (space consumed by each object
> "on its own", space of each accessible objects), but memory was much
> more of a constraint at that time, so support *had* to be good.

Hmm.  I used APL in the seventies, and at least at that time it had only two
scalar types ("number" and "8-bit character"), and one container type
("dense array with any number of dimensions").  It was superbly suited to
massive number-crunching, for algorithms that could be written naturally in
that form, but usually more trouble than it was worth for other stuff.

The memory structure of that APL was very simple, of course:  effective APL
programs slammed all their data into a small number of large arrays, and an
array, regardless of shape (rank and actual dimensions), was a single
contigous piece of storage packed with platform-native floats or characters.
No pointers, no references, no distributed structures of any kind.

If that's good enough for what you're doing, you can get a similar
simplification of world-view from the Python NumArray extension.  It's a lot
like using a Python array.array then, but with richer operators.  You can
get the exact number of bytes required by an array.array by using its
.buffer_info() method.  This is possible because an array.array is-- like
APL arrays in the days I used it --a structurally trivial object, very close
to the hardware.

It's impossible for Python to account for platform-dependent hidden overhead
imposed by the system malloc(), but for large contiguous structures that
overhead is vanishingly small.  OTOH, for most objects common in *Python*
programs, malloc overhead can be larger than the bytes Python knows about
(for example, a Python int object contains 12 bytes of information on most
32-bit boxes, but how many bytes does malloc actually use beyond that?  it
varies, and Python tries to cut back on it by using a special allocator for
int objects, so even the size of "an int" becomes a surprisingly complicated
question).

In short, if you have a small number of large contiguous objects (which was
my APL experience), reasoning about memory use is easy (and in Python or APL
or anything else).  It's when you have a large number of small and/or
"linked" objects that memory gets confusing.  An extreme in that respect is
a typical Lisp implementation, where you can easily get statistics about the
number of "cons cells" being used by your program, but no idea how that
relates to your higher-level views of the object graphs ultimately built
from cons cells.