[Zope3-dev] Make AbsoluteURL produce quoted urls

Tue Jun 1 09:24:23 EDT 2004

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/06/2004, at 7:09 AM, Bjorn Tillenius wrote:

>> Shouldn't these URL's only be quoted if they are put in a
>> src or href attribute? Only being able to output Unicode URL's
>> as encoded ASCII rather defeats the purpose of them.
>
> Yes, that's right. Do you have a use case for returning unicode URLs
> instead of ASCII ones? If we return unicode, we have to change every 
> use
> of absolute_url in Zope3. That is, change '<a tal:attributes="href
> context/@@absolute_url">' to something where you quote the URL first. I
> think that justifies returning quoted ASCII URLs, if you want something
> else, you should do something extra.

As Philip said, to display them to the user. Sounds like we have the
following use cases:

	As a clickable link, or embedded object:
	<a tal:attributes="href urlquoted_absoluteurl" />

	As text included in XML or HTML:
	<span tal:contents="htmlquoted_absoluteurl" />

	As data we need to manipulate:

	print unicode_absoluteurl.encode('utf8')
	print (unicode_absoluteurl + '/@@whatever.html').encode('utf8')
	print (unicode_absoluteurl + u'/\N{COPYRIGHT 
SIGN}.html').encode('utf8')

>> Actually, I would have thought they would not need to be encoded
>> anywhere, as the browser takes care of it. I know this is the case
>> for Unicode domain names with the major browsers (see
>> http://images.stuartbishop.net/idna.html for an example,
>> although it is using an obscure Unicode character that is
>> not present in many fonts so may look a little wonky if you
>> aren't on a Mac. Should still work though.).
>
> See: http://www.ietf.org/rfc/rfc2718.txt, Section 2.2.5
> There it states the URL should be quoted, do we want to break the
> standards? Also, let's say we give an object a name containing '?', 
> this
> won't work unless we quote the URL (we can give the object the name, 
> but
> we won't be able to traverse to it).
>
> Just to clarify things, when I'm talking about names, I mean the names
> of the objects being traversed. I don't think we should use idna for
> non-domain names.

IDNA is just for the domain name portion or the URL. I've just done
some tests and, like email addresses, you need to use multiple
encoding mechanisms to convert a Unicode URL (protocol + domain + path)
into an ASCII string.
	protocol part is ASCII (or %xx quoted UTF8?)
	domain part is IDNA
	path part is %xx quoted (by path I mean everything after the domain)

This is very important - it means we can't just use a generic mechanism
to convert a Unicode URL to an ASCII representation. We have to use
a specific mechanism that splits the URL into components, encodes them
separately, and glues it back together again. If we don't do it
correctly now, then Zope3 will never support Unicode domain names except
in their legacy ASCII encodings, which is the case with Zope2.

(This will also need to be addressed in Python, but that is a discussion
for the Web-SIG).

>> Also, on the subject of quoting URL's, do you think Unicode
>> domain names in a URL should be encoded using domain.encode('idna')
>> or using %xx notation? I suspect IDNA. If absoluteurl returns a
>> Unicode string, there will need to be a mechanism provided to
>> convert it to ASCII, as it will be non trivial (since the URL will
>> need to be split apart and the different components encoded
>> separately). I've got a similar conversion tool available
>> at http://www.stuartbishop.net/Software/EmailAddress which
>> converts Unicode email addresses to ASCII.

As a concrete example (u'\xe9' == u'\N{LATIN SMALL LETER E WITH 
ACUTE}'):

	u'http://www.ol\xe9/rene\xe9.html'
	== http://www.olé.de/reneé.html
	== http://www.xn--ol-cja.de/rene%C3%A9.html

<offtopic alert>

Oh dear. From a quick scan of RFC2718, I don't think it addresses all 
the
issues. In particular, it doesn't mention what normalization mechanism
(if any) to use before encoding the Unicode to UTF8.

 >>> path = u'/\N{LATIN CAPITAL LETTER C WITH CEDILLA}/page_\N{ROMAN 
NUMERAL ONE}.html'
 >>> path
u'/\xc7/page_\u2160.html'
 >>> for norm in ('NFC', 'NFKC', 'NFD', 'NFKD'):
...     print urllib.quote(unicodedata.normalize(norm, 
path).encode('utf8'))
...
/%C3%87/page_%E2%85%A0.html
/%C3%87/page_I.html
/C%CC%A7/page_%E2%85%A0.html
/C%CC%A7/page_I.html
 >>>

Hopefully, browsers will preserve the Unicode string as sent. I suspect,
however, that they will actually store it as a Unicode string rather 
than
keeping it as encoded ASCII, which may end up with the normalization 
being
changed (as I think many Unicode libraries implicitly normalize Unicode
strings so that common operations like string comparisons work as
expected). This is also an issue if people type in a Unicode URL -
I think we will get a string in an platform independent normalization
form.

Hmm... this has a large impact on Zope3 internals. The publishing
machinery will need to normalize any UTF-8 encoded paths it receives.
All Unicode strings that might end up being used as a path component
will need to be normalized using the same mechanism. Or better yet,
all run through a stringprep profile (
http://www.faqs.org/rfcs/rfc3454.html ) similar to how IDNA handles
this.

If this isn't making sense to anyone, in a nutshell we need
to remove the ambiguity where the following Unicode strings render
identically, but compare differently:

 >>> c = u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}'
 >>> normalize('NFC', c)
u'\xc7'
 >>> normalize('NFD', c)
u'C\u0327'
 >>> print normalize('NFC', c).encode('utf8')
Ç
 >>> print normalize('NFD', c).encode('utf8')
Ç
 >>> normalize('NFC', c) == normalize('NFD', c)
False
 >>>

- --  
Stuart Bishop <stuart at stuartbishop.net>
http://www.stuartbishop.net/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (Darwin)

iD8DBQFAvIOMAfqZj7rGN0oRAic1AJ9GR98WqTOAgt9do5obUqV5K3MshgCfX9xv
MsID0elqz8BGOQwodnnry8E=
=rwhO
-----END PGP SIGNATURE-----