[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Tue Jan 16 10:39:04 EST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martijn Faassen wrote:
> Andreas Jung wrote:
>>
>> --On 15. Januar 2007 22:15:46 +0100 Martijn Faassen 
> [snip]
>>>> I still don't see what should ambiguous with this approach.
>>> Ambiguous in that the string seems to say it's in two encodings at once.
>>> You're then "guessing": you're letting the Python string type trump the
>>> declaration. Then, since we've shown that leads to bugs, you propose
>>> actually change the encoding declaration of the XML document. I wonder
>>> what people then expect to happen upon serialization. In effect, your
>>> proposal would, I think, serialize to UTF-8 only, right? (in which case
>>> the encoding declaration can be dropped as it's the default.
>> When you download a ZPT through FTP/WebDAV then the unicode representation
>> of the XML will be converted using the 'output_encoding' property of the
>> corresponding ZPT which is set when uploading a new XML document (and taken
>> from the premable). So when you upload an latin1 XML file you should get 
>> it back as valid latin1 through FTP/WebDAV.
> 
> Okay, understood, this makes sense in the case of the FTP/WebDAV 
> support, though recoding to UTF-8 and ripping off the encoding 
> declaration would also be pretty safe in case of XML.
> 
>> When you download text/xml content through the ZPublisher then the 
>> ZPublisher will convert unicode textual content to some encoding which is
>> either taken from an already set 'content-type: text/...; charset=XXXXX'
>> HTTP Header or as fallback from the zpublisher-default-encoding property
>> as defined in the zope.conf file.
> 
> And the same behavior actually applies to HTML content, right?
> 
>> So the application can specify in both case the encoding of the serialized
>> XML content. Where is the problem?
> 
> What I'm trying to express here is that this stuff should not be treated 
> as "where is the problem?" but should be thought through carefully as 
> this is extremely easy to do wrong. I'll think it through carefully 
> here. Let's list some cases:
> 
> A) FTP download: stored ML gets downloaded through FTP/WebDAV support.
> 
> B) FTP upload: external XML gets uploaded through FTP/WebDAV
> 
> C) parse: stored XML is parsed inside of Zope by the page template engine.
> 
> D) publisher download: stored XML is downloaded as text/xml directly 
> through the publisher
> 
> E) ZPT inclusion: stored XML is included in another page template, for 
> instance to present it in a text area.
> 
> F) form submit: Text area is then saved and needs to be stored again.
> 
> Andreas Jung proposal (speculation)
> ===================================
> 
> As far as I understand it you're proposing:
> 
> * store XML as unicode text
> 
> * separately store the encoding on the page template object
> 
> * also keep the encoding="" bit in the XML preamble when storing.
> 
> Let's go through the cases
> 
> A) FTP download: encode this to whatever encoding is stored on the ZPT 
> object using Python unicode support. No encoding mangling necessary.
> 
> B) FTP upload: read encoding="" bit and store this on ZPT. Then decode 
> to unicode using that encoding. Could not be implemented by a 
> parse/serialization step without extra encoding="" manipulation 
> afterwards (after decoding to unicode).
> 
> C) parse: Rip out the 'encoding=""' bit before you send it in the 
> parser. encode to UTF-8 just before entering the parser.
> 
> D) publisher download: Rip out the 'encoding=""' bit. Then encode 
> according to response header (or zope.conf). Then add back encoding="" 
> bit stating if output is non-UTF-8 (not Python names like 'latin1' but 
> encoding identifiers XML is aware of).
> 
> E) ZPT inclusion: Send the unicode text to the page template. 
> encoding="" bit will be presented in the editor.
> 
> F) form submit: decode to unicode according to encoding of page that 
> displayed edit form and store it. Read 'encoding=' bit and store it in 
> ZPT object. Don't manipulate 'encoding=""' bit in XML.
> 
> encoding="" removal: C, D
> encoding="" adding: D
> encoding="" reading: B, F
> encode from unicode: A, C, D
> decode to unicode: B, F
> 
> no encoding="" manipulation required: A, E
> no recoding required: E
> straightforward: E
> 
> The forms editor scenario (E and F) is potentially confusing as the user 
> may be tempted by the ability to use encoding="" to paste latin-1 XML 
> text. Editor could say it only wants it in whatever encoding the page is 
> in, though.
> 
> Martijn Faassen proposal
> ========================
> 
> If you rip out the encoding before data is stored in the page template 
> and then store as unicode, then we have the following cases:
> 
> A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding 
> mangling necessary.
> 
> B) FTP upload: read encoding="" bit and decode to unicode accordingly. 
> Rip out encoding="". Could be done by a parse/serialization step, then 
> decode result to unicode.
> 
> C) parse: encode to UTF-8 just before entering the parser.
> 
> D) publisher download: Encode according to response header or zope.conf. 
> Add in encoding="" if output is non-UTF-8 using XML names for encoding.
> 
> E) ZPT inclusion: send unicode text to the page template. No encoding="" 
> bit will be in the XML presented in the editor.
> 
> F) form submit: Rip out any encoding="" before storing, ignoring it as 
> XML was in output encoding, then convert to unicode using input encoding.
> 
> encoding="" removal: B, F
> encoding="" adding: D
> encoding="" reading: B
> encode from unicode: A, C, D
> decode to unicode: B, F
> 
> no encoding="" manipulation required: A, C, E
> no recoding required: E
> straightforward: E
> 
> No storage of encoding information on ZPT object is necessary.
> 
> Case B) potentially confusion as upon re-download XML document will be 
> recoded to UTF-8 (though XML editors should be able to deal with this as 
> it's the default).
> 
> Form edit still potentially confusing as encoding="" bit disappears, but 
> at least suggestion to user is not made that information *presented* in 
> a textarea is in a particular encoding specified in the encoding="" bit.
> 
> Tres Seaver proposal (speculation)
> ==================================
> 
> Storage in UTF-8.
> 
> A) FTP download: output in UTF-8 only, can be done directly.
> 
> B) FTP upload: read encoding="" bit and, if not UTF-8, decode to unicode 
> accordingly. Then recode to UTF-8. Rip out encoding="". Could be done by 
> an XML parse/serialization step.
> 
> C) parse: just pass UTF-8 to parser.
> 
> D) publisher download: Decode to unicode. Then recode to desired output 
> encoding (with XML names for encoding added in encoding="") bit.
> 
> E) ZPT inclusion: Decode text to unicode. No encoding="" bit will be in 
> the XML presented in the editor.
> 
> F) form submit: Rip out any encoding="" before storing, ignoring it as 
> XML was in output encoding, then convert to unicode using that encoding, 
> then convert again to UTF-8.
> 
> encoding="" removal: B, F
> encoding="" adding: D
> encoding="" reading: B
> encode from unicode: B, D, F
> decode to unicode: B, D, F
> 
> no encoding="" manipulation required: A, C, E
> no recoding required: A, C (B and F if UTF-8 uploaded)
> straightforward: A, C
> 
> No storage of encoding information in ZPT object is necessary.
> 
> Case B) potentially confusion as upon re-download XML document will be 
> recoded to UTF-8 (though XML editors should be able to deal with this as 
> it's the default).
> 
> Form edit still potentially confusing as encoding="" bit disappears, but 
> at least suggestion to user is not made that information *presented* in 
> a textarea is in a particular encoding specified in the encoding="" bit.
> 
> Just store the XML text
> =======================
> 
> Storage XML text literally as received. Maybe this is actually what Tres 
> meant. :)
> 
> A) FTP download: output can be done directly.
> 
> B) FTP upload: store input directly
> 
> C) parse: just pass text to parser.
> 
> D) publisher download: Decode to unicode using encoding="" bit. Remove 
> encoding bit. Then recode to desired output encoding (with XML names for 
> encoding added in encoding="") bit.
> 
> E) ZPT inclusion: Decode text to unicode using encoding="" bit.
> 
> F) form submit: Encode text in form from unicode according to 
> encoding="" bit.
> 
> encoding="" removal: D
> encoding="" adding: D
> encoding="" reading: B, D, E, F
> encode from unicode: D, F
> decode to unicode: D, E
> 
> no encoding="" manipulation required: A, C (but B, E, F only reading)
> no recoding required: A, B, C
> straightforward: A, C
> 
> No storage of encoding information in ZPT object is necessary, though 
> could be done to optimize extraction of encoding=""
> 
> Form edit potentially confusing as in Andreas Jung scenario.
> 
> ..............
> 
> Any use cases I missed or got wrong? The scenarios are all complicated. :)
> 
> 
> The "Andreas Jung" scenario has "leave the XML text alone except make it 
> unicode" goal in mind, but actually ends up messing about with 
> "encoding=""" more than the other scenarios.
> 
> The "Martijn Faassen" scenario tries to follow the rule: decode to 
> unicode on input, get rid of encoding="" in XML, and encode only on 
> output as much as possible, with the exception of the parser call.
> 
> The Tres Seaver scenario as I sketched it has the "turn the XML into 
> UTF-8" goal. It needs to do recoding less frequently than the other 
> scenarios, though more frequently than one would hope.
> 
> The "just store the XML" scenario is in surprisingly nice. It only needs 
> attention to encoding and decoding in the always complicated ZPublisher 
> direct output scenario, and in the edit form scenario.

As you speculated, this is actually my preference, except that I don't
see the need to in scenario D to recode the data and strip the prolog
encoding attribute.  Why wouldn't we just use the XML template's own
declared encoding to encode any data subsituted into the template?  I
mean, if the user has marked up the document to indicate a "preferred"
encoding, why should we bother storing such an encoding in another location?

Then the only time we would need to munge the document would be at
inclusion time, which is the only time we actually *need* to have
unicode in hand.  We might even elide the decode-recode stage if the
target document uses the same encoding!  That such an optimization might
not be worth the complexity, however.

Note that in the inclusion case (scenario E), we almost certainly
*should* be stripping the *entire* prolog, which is only valid at the
start of the merged document.  I guess there is a subscenario, which is
that the "included" document is actually the 'main_template' supplying
the prolog:  METAL might should leave the prolog alone, while
'tal:replace' and 'tal:content' (with 'structure') would strip it?

> The "just store XML" proposal starts to look attractive. It requires 
> very little actual XML text manipulation, only in D, and while it does 
> require more reading of the encoding="" bit, this can be cached and at 
> least doesn't require string manipulation. Care can be taken that there 
> is an API to represent the XML as unicode strings - this is done for 
> display purposes only (clearly human readable text) and this is the only 
> case where the encoding="" bit is rather misleading.

Best,

Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFrPGX+gerLs4ltQ4RApHKAJ9PnPRcmL7N7MOKI1bW6q1HwaCjnwCgsGAL
tGMTk0ARm8WErOeoSEFfOGk=
=tr4C
-----END PGP SIGNATURE-----