[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Tue Jan 16 08:12:46 EST 2007

Andreas Jung wrote:
> 
> 
> --On 15. Januar 2007 22:15:46 +0100 Martijn Faassen 
[snip]
>>> I still don't see what should ambiguous with this approach.
>>
>> Ambiguous in that the string seems to say it's in two encodings at once.
>> You're then "guessing": you're letting the Python string type trump the
>> declaration. Then, since we've shown that leads to bugs, you propose
>> actually change the encoding declaration of the XML document. I wonder
>> what people then expect to happen upon serialization. In effect, your
>> proposal would, I think, serialize to UTF-8 only, right? (in which case
>> the encoding declaration can be dropped as it's the default.
> 
> When you download a ZPT through FTP/WebDAV then the unicode representation
> of the XML will be converted using the 'output_encoding' property of the
> corresponding ZPT which is set when uploading a new XML document (and taken
> from the premable). So when you upload an latin1 XML file you should get 
> it back as valid latin1 through FTP/WebDAV.

Okay, understood, this makes sense in the case of the FTP/WebDAV 
support, though recoding to UTF-8 and ripping off the encoding 
declaration would also be pretty safe in case of XML.

> When you download text/xml content through the ZPublisher then the 
> ZPublisher will convert unicode textual content to some encoding which is
> either taken from an already set 'content-type: text/...; charset=XXXXX'
> HTTP Header or as fallback from the zpublisher-default-encoding property
> as defined in the zope.conf file.

And the same behavior actually applies to HTML content, right?

> So the application can specify in both case the encoding of the serialized
> XML content. Where is the problem?

What I'm trying to express here is that this stuff should not be treated 
as "where is the problem?" but should be thought through carefully as 
this is extremely easy to do wrong. I'll think it through carefully 
here. Let's list some cases:

A) FTP download: stored ML gets downloaded through FTP/WebDAV support.

B) FTP upload: external XML gets uploaded through FTP/WebDAV

C) parse: stored XML is parsed inside of Zope by the page template engine.

D) publisher download: stored XML is downloaded as text/xml directly 
through the publisher

E) ZPT inclusion: stored XML is included in another page template, for 
instance to present it in a text area.

F) form submit: Text area is then saved and needs to be stored again.

Andreas Jung proposal (speculation)
===================================

As far as I understand it you're proposing:

* store XML as unicode text

* separately store the encoding on the page template object

* also keep the encoding="" bit in the XML preamble when storing.

Let's go through the cases

A) FTP download: encode this to whatever encoding is stored on the ZPT 
object using Python unicode support. No encoding mangling necessary.

B) FTP upload: read encoding="" bit and store this on ZPT. Then decode 
to unicode using that encoding. Could not be implemented by a 
parse/serialization step without extra encoding="" manipulation 
afterwards (after decoding to unicode).

C) parse: Rip out the 'encoding=""' bit before you send it in the 
parser. encode to UTF-8 just before entering the parser.

D) publisher download: Rip out the 'encoding=""' bit. Then encode 
according to response header (or zope.conf). Then add back encoding="" 
bit stating if output is non-UTF-8 (not Python names like 'latin1' but 
encoding identifiers XML is aware of).

E) ZPT inclusion: Send the unicode text to the page template. 
encoding="" bit will be presented in the editor.

F) form submit: decode to unicode according to encoding of page that 
displayed edit form and store it. Read 'encoding=' bit and store it in 
ZPT object. Don't manipulate 'encoding=""' bit in XML.

encoding="" removal: C, D
encoding="" adding: D
encoding="" reading: B, F
encode from unicode: A, C, D
decode to unicode: B, F

no encoding="" manipulation required: A, E
no recoding required: E
straightforward: E

The forms editor scenario (E and F) is potentially confusing as the user 
may be tempted by the ability to use encoding="" to paste latin-1 XML 
text. Editor could say it only wants it in whatever encoding the page is 
in, though.

Martijn Faassen proposal
========================

If you rip out the encoding before data is stored in the page template 
and then store as unicode, then we have the following cases:

A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding 
mangling necessary.

B) FTP upload: read encoding="" bit and decode to unicode accordingly. 
Rip out encoding="". Could be done by a parse/serialization step, then 
decode result to unicode.

C) parse: encode to UTF-8 just before entering the parser.

D) publisher download: Encode according to response header or zope.conf. 
Add in encoding="" if output is non-UTF-8 using XML names for encoding.

E) ZPT inclusion: send unicode text to the page template. No encoding="" 
bit will be in the XML presented in the editor.

F) form submit: Rip out any encoding="" before storing, ignoring it as 
XML was in output encoding, then convert to unicode using input encoding.

encoding="" removal: B, F
encoding="" adding: D
encoding="" reading: B
encode from unicode: A, C, D
decode to unicode: B, F

no encoding="" manipulation required: A, C, E
no recoding required: E
straightforward: E

No storage of encoding information on ZPT object is necessary.

Case B) potentially confusion as upon re-download XML document will be 
recoded to UTF-8 (though XML editors should be able to deal with this as 
it's the default).

Form edit still potentially confusing as encoding="" bit disappears, but 
at least suggestion to user is not made that information *presented* in 
a textarea is in a particular encoding specified in the encoding="" bit.

Tres Seaver proposal (speculation)
==================================

Storage in UTF-8.

A) FTP download: output in UTF-8 only, can be done directly.

B) FTP upload: read encoding="" bit and, if not UTF-8, decode to unicode 
accordingly. Then recode to UTF-8. Rip out encoding="". Could be done by 
an XML parse/serialization step.

C) parse: just pass UTF-8 to parser.

D) publisher download: Decode to unicode. Then recode to desired output 
encoding (with XML names for encoding added in encoding="") bit.

E) ZPT inclusion: Decode text to unicode. No encoding="" bit will be in 
the XML presented in the editor.

F) form submit: Rip out any encoding="" before storing, ignoring it as 
XML was in output encoding, then convert to unicode using that encoding, 
then convert again to UTF-8.

encoding="" removal: B, F
encoding="" adding: D
encoding="" reading: B
encode from unicode: B, D, F
decode to unicode: B, D, F

no encoding="" manipulation required: A, C, E
no recoding required: A, C (B and F if UTF-8 uploaded)
straightforward: A, C

No storage of encoding information in ZPT object is necessary.

Case B) potentially confusion as upon re-download XML document will be 
recoded to UTF-8 (though XML editors should be able to deal with this as 
it's the default).

Form edit still potentially confusing as encoding="" bit disappears, but 
at least suggestion to user is not made that information *presented* in 
a textarea is in a particular encoding specified in the encoding="" bit.

Just store the XML text
=======================

Storage XML text literally as received. Maybe this is actually what Tres 
meant. :)

A) FTP download: output can be done directly.

B) FTP upload: store input directly

C) parse: just pass text to parser.

D) publisher download: Decode to unicode using encoding="" bit. Remove 
encoding bit. Then recode to desired output encoding (with XML names for 
encoding added in encoding="") bit.

E) ZPT inclusion: Decode text to unicode using encoding="" bit.

F) form submit: Encode text in form from unicode according to 
encoding="" bit.

encoding="" removal: D
encoding="" adding: D
encoding="" reading: B, D, E, F
encode from unicode: D, F
decode to unicode: D, E

no encoding="" manipulation required: A, C (but B, E, F only reading)
no recoding required: A, B, C
straightforward: A, C

No storage of encoding information in ZPT object is necessary, though 
could be done to optimize extraction of encoding=""

Form edit potentially confusing as in Andreas Jung scenario.

..............

Any use cases I missed or got wrong? The scenarios are all complicated. :)

The "Andreas Jung" scenario has "leave the XML text alone except make it 
unicode" goal in mind, but actually ends up messing about with 
"encoding=""" more than the other scenarios.

The "Martijn Faassen" scenario tries to follow the rule: decode to 
unicode on input, get rid of encoding="" in XML, and encode only on 
output as much as possible, with the exception of the parser call.

The Tres Seaver scenario as I sketched it has the "turn the XML into 
UTF-8" goal. It needs to do recoding less frequently than the other 
scenarios, though more frequently than one would hope.

The "just store the XML" scenario is in surprisingly nice. It only needs 
attention to encoding and decoding in the always complicated ZPublisher 
direct output scenario, and in the edit form scenario.

The "just store XML" proposal starts to look attractive. It requires 
very little actual XML text manipulation, only in D, and while it does 
require more reading of the encoding="" bit, this can be cached and at 
least doesn't require string manipulation. Care can be taken that there 
is an API to represent the XML as unicode strings - this is done for 
display purposes only (clearly human readable text) and this is the only 
case where the encoding="" bit is rather misleading.

Regards,

Martijn