[ZODB-Dev] Space used by IOBTrees

Tim Peters tim@zope.com
Fri, 28 Feb 2003 13:13:18 -0500


[Andreas Jung]
> ...
> Another question: I had a closer look at the pickles itself using
> pickletools. The PCDATA parts of the XML document were stored inside
> the tree as unicode strings. Inside the disassembled pickle
> they were "marked" as BINUNICODE. What encoding is used to pickle
> unicode strings (looks like utf-8 rather when UCS-2)?

Yes, it's UTF-8.  Note that pickletools.py is meant to be "executable
documentation":  there's little about pickles you can't learn from reading
it.  If you search the source file for BINUNICODE, you'll find this:

    I(name='BINUNICODE',
      code='X',
      arg=unicodestring4,
      stack_before=[],
      stack_after=[pyunicode],
      proto=1,
      doc="""Push a Python Unicode string object.

      There are two arguments:  the first is a 4-byte little-endian
      signed int giving the number of bytes in the string.  The second is
      that many bytes, and is the UTF-8 encoding of the Unicode string.
      """),

It took an enormous amount of time to reverse-engineer and document all this
stuff, so I'm keen that people know they don't have to do that from scratch
every time anymore <wink>.