[Checkins] SVN: zope.mimetype/trunk/src/zope/mimetype/typegetter.
Changed zope.mimetype.typegetter.charsetGetter to look at the
data and try to
Marius Gedminas
marius at pov.lt
Tue May 30 12:10:53 EDT 2006
Log message for revision 68395:
Changed zope.mimetype.typegetter.charsetGetter to look at the data and try to
recognize ASCII, UTF-8, UTF-16BE, UTF-16LE when the charset is not specified in
the content type. Changed it to actually return None (as ICharsetGetter
documents) instead of 'ascii' when the charset cannot be determined. No tests
were broken by this change, hope it's OK. ;)
Changed:
U zope.mimetype/trunk/src/zope/mimetype/typegetter.py
U zope.mimetype/trunk/src/zope/mimetype/typegetter.txt
-=-
Modified: zope.mimetype/trunk/src/zope/mimetype/typegetter.py
===================================================================
--- zope.mimetype/trunk/src/zope/mimetype/typegetter.py 2006-05-30 10:40:18 UTC (rev 68394)
+++ zope.mimetype/trunk/src/zope/mimetype/typegetter.py 2006-05-30 16:10:52 UTC (rev 68395)
@@ -16,6 +16,7 @@
# back into that, or this package could provide a replacement.
#
import mimetypes
+import codecs
from zope import interface
from zope.mimetype import interfaces
@@ -113,7 +114,22 @@
except ValueError:
pass
else:
- return params.get("charset") or "ascii"
- return "ascii"
+ if params.get("charset"):
+ return params["charset"]
+ if data:
+ if data.startswith(codecs.BOM_UTF16_LE):
+ return 'utf-16le'
+ elif data.startswith(codecs.BOM_UTF16_BE):
+ return 'utf-16be'
+ try:
+ unicode(data, 'ascii')
+ return 'ascii'
+ except UnicodeDecodeError:
+ try:
+ unicode(data, 'utf-8')
+ return 'utf-8'
+ except UnicodeDecodeError:
+ pass
+ return None
interface.directlyProvides(charsetGetter, interfaces.ICharsetGetter)
Modified: zope.mimetype/trunk/src/zope/mimetype/typegetter.txt
===================================================================
--- zope.mimetype/trunk/src/zope/mimetype/typegetter.txt 2006-05-30 10:40:18 UTC (rev 68394)
+++ zope.mimetype/trunk/src/zope/mimetype/typegetter.txt 2006-05-30 16:10:52 UTC (rev 68395)
@@ -211,3 +211,38 @@
determined from the filename using information from the uploaded
data. The specific approach taken by the extractor is not part of the
interface, however.
+
+
+`charsetGetter()`
+~~~~~~~~~~~~~~~~~
+
+If you're interested in the character set of textual data, you can use
+the `charsetGetter` function (which can also be registered as the
+`ICharsetGetter` utility):
+
+The simplest case is when the character set is already specified in the
+content type.
+
+ >>> typegetter.charsetGetter(content_type='text/plain; charset=mambo-42')
+ 'mambo-42'
+
+If it isn't, `charsetGetter` can try to guess by looking at actual data
+
+ >>> typegetter.charsetGetter(content_type='text/plain', data='just text')
+ 'ascii'
+
+ >>> typegetter.charsetGetter(content_type='text/plain', data='\xe2\x98\xba')
+ 'utf-8'
+
+ >>> import codecs
+ >>> typegetter.charsetGetter(data=codecs.BOM_UTF16_BE + '\x12\x34')
+ 'utf-16be'
+
+ >>> typegetter.charsetGetter(data=codecs.BOM_UTF16_LE + '\x12\x34')
+ 'utf-16le'
+
+If the character set cannot be determined, `charsetGetter` returns None.
+
+ >>> typegetter.charsetGetter(content_type='text/plain', data='\xff')
+ >>> typegetter.charsetGetter()
+
More information about the Checkins
mailing list