[ZCM] [ZC] 1474/ 3 Comment "PageTemplateFile opens XML files in binary mode"

Tue Oct 5 13:02:40 EDT 2004

Issue #1474 Update (Comment) "PageTemplateFile opens XML files in binary mode"
 Status Pending, Zope/bug medium
To followup, visit:
  http://collector.zope.org/Zope/1474

==============================================================
= Comment - Entry #3 by fdrake on Oct 5, 2004 1:02 pm

Entry #2 by yuppie on Oct 5, 2004 12:34 pm:
> This is a problem in CMFSetup. CMFSetup creates XML using
>   PageTemplateFiles. These files are checked in to CVS in text mode. So
>   depending on the platform, they contain different newlines. If opened as
>   text file, these newlines are normalized to LF. But opened as binary
>   files, newlines are not normalized. Normalizing could be done at a later
>   point, but that's not the case. So line breaks are not normalized before
>   parsing, but the parser expects LF newlines.

Are you actually observing CR characters in the generated output?  If so, are you certain these are generated due to template data or do they come from some other source (inserted content, for example)?

XML input is parsed by Expat; I'd be very interested in learning of failures of Expat to properly normalize input data.

If you can generate a template that exhibits this, please attach it to this tracker issue; the shorter, the better.

Thanks.

________________________________________
= Comment - Entry #2 by yuppie on Oct 5, 2004 12:34 pm

Fred Drake wrote:
> This report isn't clear.  Please update the issue and explain what the
> problem is; glancing at the code on the Zope 2 and Zope 3 trunks, the
> only thing that looks suspicious to me is that re-opening an HTML file
> doesn't use Python's universal newline support.
> 
> HTML is always text, so should be treated that way on input.  XML may
> contain textual content, but should always be handed to the XML parser
> as a raw byte stream to allow the proper decoding machinery a shot at
> doing the right thing.

I try to restate the issue:

This is a problem in CMFSetup. CMFSetup creates XML using PageTemplateFiles. These files are checked in to CVS in text mode. So depending on the platform, they contain different newlines. If opened as text file, these newlines are normalized to LF. But opened as binary files, newlines are not normalized. Normalizing could be done at a later point, but that's not the case. So line breaks are not normalized before parsing, but the parser expects LF newlines.

Removing newlines, the parser removes only LF, leaving in the CR. Adding newlines, the parser adds LF. Existing newlines are preserved as CR/LF. So the returned XML contains all 3 kinds of newlines.

This is what the XML 1.0 spec says:

"""2.11 End-of-Line Handling

XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters CARRIAGE RETURN (#xD) and LINE FEED (#xA).

To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character."""
________________________________________
= Request - Entry #1 by yuppie on Aug 19, 2004 11:49 am

This is a problem on Windows. If I read the specs ( http://www.w3.org/TR/2004/REC-xml-20040204/#sec-line-ends ) correctly, Windows newlines are allowed within XML. But PageTemplateFile opens them in binary mode, ignoring the fact the file might contain CRs. As a result, the parsed files contain a mix of CR/LF, LF and even CR newlines.

Is there any good reason why this was fixed for HTML, but not for XML files?
==============================================================