[Zope3-dev] Space usage of unicode strings in the ZODB
Tim Peters
tim@zope.com
Thu, 14 Feb 2002 13:58:42 -0500
[Andreas Jung]
> Based on the discussion either to unicode strings
> or UTF-X encoded strings in Zope 3 I made some tests
> to get some ideas about the space usage of unicode
> strings in the ZODB.
>
> Test input was a Latin-1 document (2.6 MB, 374.000 words).
> The list of words has been stored in a Standalone ZODB.
>
> Results:
>
> String encodings:
> UTF-7 4.5 MB
> UTF-8 4.4 MB
> UTF-16 7.6 MB
> UTF-32 unknown (Python does not seem to support this encoding
> ???)
>
> Unicode strings:
> internal UCS-2 encoding 5.4 MB
> internal UCS-4 encoding 5.4 MB
>
>
> I am astonished that unicode strings require the same space -
> independant from their internal storage in Python (2 vs. 4 bytes).
I don't understand what you're doing well enough to say for sure, but
wouldn't any such test just be measuring how cPickle encodes strings? It's
not entirely clear, but I assume you're measuring final database size, and
not (e.g.) process memory size, and I see that binary pickles always convert
Unicode strings to UTF-8. So Python's internal representation should be
irrelevant. Unlike storing UTF-8 strings directly, though, BINUNICODE
appears always to use 5 bytes to store the string's length, so is less
disk-efficient for short strings than pickle's binary SHORT_BINSTRING
encoding.