[Zope3-Users] Re: Everything printing as ASCII?

Derrick Hudson dman at dman13.dyndns.org
Thu Aug 11 09:45:41 EDT 2005

On Wed, Aug 10, 2005 at 04:53:22PM -0300, Alec Munro wrote:
| I think it was using StringIO.
| My default python encoding was ASCII, so I changed that, and it ended
| up solving my problems, so I haven't looked further into this, but I
| assume that while all of the pieces were unicode to start with, Python
| was converting some of them to ASCII earlier on.
| I believe I'm getting the hang of this encoding thing, but don't quote
| me on that.

Python has two kinds of objects (that are relevant here):  unicode and str.
A str is simply a sequence of bytes.  The data could be ASCII encoded
text, ISO-* encoded text or any arbitrary binary data.  Typically,
when used as a typical string, the data is ASCII encoded text.
Unicode objects are not sequences of bytes, but rather sequences of
unicode characters.  Unicode characters are 16-bits wide, thus are two
bytes each.

The problem with unicode is that all the underlying infrastructure
operates on bytes, not unicode characters.  This includes memory
addressing and file and socket streams.  To reconcile this
incompatibility, people defined encodings that allow unicode
characters (and thus unicode strings) to be represented as sequences
of bytes.  The most commonly used unicode encoding is UTF-8.

In python, streams (ie files) operate on sequences of bytes, and thus
they naturally can handle str objects.  This includes the .write()
method and the print statement.  The print statement is a little
special in that python tries to be helpful and automatically convert
any python object to a str object so it can be sent to a file.
Sometimes this does what you want, eg 'print 1' but sometimes it isn't
terribly helpful.  When it comes to unicode objects, the str()
function tries to encode the unicode characters as a sequence of
bytes.  For sanity it uses sys.getdefaultencoding() as the encoding to
use.  The default encoding is ASCII because that works everywhere and
is backward compatible to before python had unicode support.

To go back to some foundational stuff, an "encoding" is simply a
definition of how to convert bytes to a higher-level abstract concept,
namely characters.  Bytes are really just numbers, the most natural
concept to represent.  Characters are a more abstract concept defined
by human languages.  Characters don't have a nice well-behaved set of
rules like numbers do, which is why numbers are the basis for
everything in a computer.  Even ASCII is an encoding in that it is
just one definition of how to treat the numbers (bytes) as characters.
It just happens to be the most widely used and essentially supported
by every piece of software.  (ignore old IBM mainframes that used

ASCII has its limits, though, and that is why the ISO-* encodings were
defined and later why Unicode was defined.  The problem now is that
since ASCII is the lowest common denominator, not everything can be
represented in ASCII.  If the characters can be represented in ascii,
everything is fine:
    print u"Hello World"
But if not,
    print u"\u20ac"
then an exception is raised:
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac'
however using an encoding that can represent that character works:
    print u"\u20ac".encode("utf-8")

The magic is in the encode() method.  That takes the unicode object
and following the specified rules (utf-8 in this case) it returns a
sequence of bytes (a 'str' object).  As long as my environment (eg
xterm, firefox) expects and can understand UTF-8 encoded data this
works great.  However, if my environment doesn't know how to handle
utf-8 encoded characters (eg if it expects ISO-8859-1) then I will get
a garbled mess displayed.

To summarize, python's 'print' statement automatically converts any
object to a string using str().  For unicode objects, this means using
sys.getdefaultencoding().  An encoding is simply a set of rules on how
to convert between bytes (simple numbers) and characters (higher-level
abstract concepts).  The trick is simply to use the right encoding and
to encode or decode where appropriate.


If I receive a message from you, you are agreeing that:
   1. I am by definition, "the intended recipient"
   2. All information in the email is mine to do with as I see fit and make
        such financial profit, political mileage, or good joke as it lends
        itself to. In particular, I may quote it on USENET or the WWW.
   3. I may take the contents as representing the views of your company.
   4. This overrides any disclaimer or statement of confidentiality that may
        be included on your message
www: http://dman13.dyndns.org/~dman/            jabber: dman at dman13.dyndns.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mail.zope.org/pipermail/zope3-users/attachments/20050811/6dcbeead/attachment.bin

More information about the Zope3-users mailing list