[Zope3-dev] Space usage of unicode strings in the ZODB

Tim Peters tim@zope.com
Thu, 14 Feb 2002 13:58:42 -0500


[Andreas Jung]
> Based on the discussion either to unicode strings
> or UTF-X encoded strings in Zope 3 I made some tests
> to get some ideas about the space usage of unicode
> strings in the ZODB.
>
> Test input was a Latin-1 document (2.6 MB, 374.000 words).
> The list of words has been stored in a Standalone ZODB.
>
> Results:
>
> String encodings:
> UTF-7              4.5 MB
> UTF-8              4.4 MB
> UTF-16             7.6 MB
> UTF-32             unknown (Python does not seem to support this encoding
> ???)
>
> Unicode strings:
> internal UCS-2 encoding         5.4 MB
> internal UCS-4 encoding         5.4 MB
>
>
> I am astonished that unicode strings require the same space -
> independant from their internal storage in Python (2 vs. 4 bytes).

I don't understand what you're doing well enough to say for sure, but
wouldn't any such test just be measuring how cPickle encodes strings?  It's
not entirely clear, but I assume you're measuring final database size, and
not (e.g.) process memory size, and I see that binary pickles always convert
Unicode strings to UTF-8.  So Python's internal representation should be
irrelevant.  Unlike storing UTF-8 strings directly, though, BINUNICODE
appears always to use 5 bytes to store the string's length, so is less
disk-efficient for short strings than pickle's binary SHORT_BINSTRING
encoding.