[Checkins] SVN: zope.mimetype/trunk/src/zope/mimetype/typegetter. Changed zope.mimetype.typegetter.charsetGetter to look at the data and try to

Tue May 30 12:10:53 EDT 2006

Log message for revision 68395:
  Changed zope.mimetype.typegetter.charsetGetter to look at the data and try to
  recognize ASCII, UTF-8, UTF-16BE, UTF-16LE when the charset is not specified in
  the content type.  Changed it to actually return None (as ICharsetGetter
  documents) instead of 'ascii' when the charset cannot be determined.  No tests
  were broken by this change, hope it's OK.  ;)
  
  

Changed:
  U   zope.mimetype/trunk/src/zope/mimetype/typegetter.py
  U   zope.mimetype/trunk/src/zope/mimetype/typegetter.txt

-=-
Modified: zope.mimetype/trunk/src/zope/mimetype/typegetter.py
===================================================================

--- zope.mimetype/trunk/src/zope/mimetype/typegetter.py	2006-05-30 10:40:18 UTC (rev 68394)
+++ zope.mimetype/trunk/src/zope/mimetype/typegetter.py	2006-05-30 16:10:52 UTC (rev 68395)
@@ -16,6 +16,7 @@
 # back into that, or this package could provide a replacement.
 #
 import mimetypes
+import codecs
 
 from zope import interface
 from zope.mimetype import interfaces
@@ -113,7 +114,22 @@
         except ValueError:
             pass
         else:
-            return params.get("charset") or "ascii"
-    return "ascii"
+            if params.get("charset"):
+                return params["charset"]
+    if data:
+        if data.startswith(codecs.BOM_UTF16_LE):
+            return 'utf-16le'
+        elif data.startswith(codecs.BOM_UTF16_BE):
+            return 'utf-16be'
+        try:
+            unicode(data, 'ascii')
+            return 'ascii'
+        except UnicodeDecodeError:
+            try:
+                unicode(data, 'utf-8')
+                return 'utf-8'
+            except UnicodeDecodeError:
+                pass
+    return None
 
 interface.directlyProvides(charsetGetter, interfaces.ICharsetGetter)

Modified: zope.mimetype/trunk/src/zope/mimetype/typegetter.txt
===================================================================
--- zope.mimetype/trunk/src/zope/mimetype/typegetter.txt	2006-05-30 10:40:18 UTC (rev 68394)
+++ zope.mimetype/trunk/src/zope/mimetype/typegetter.txt	2006-05-30 16:10:52 UTC (rev 68395)
@@ -211,3 +211,38 @@
 determined from the filename using information from the uploaded
 data.  The specific approach taken by the extractor is not part of the
 interface, however.
+
+
+`charsetGetter()`
+~~~~~~~~~~~~~~~~~
+
+If you're interested in the character set of textual data, you can use
+the `charsetGetter` function (which can also be registered as the
+`ICharsetGetter` utility):
+
+The simplest case is when the character set is already specified in the
+content type.
+
+  >>> typegetter.charsetGetter(content_type='text/plain; charset=mambo-42')
+  'mambo-42'
+
+If it isn't, `charsetGetter` can try to guess by looking at actual data
+
+  >>> typegetter.charsetGetter(content_type='text/plain', data='just text')
+  'ascii'
+
+  >>> typegetter.charsetGetter(content_type='text/plain', data='\xe2\x98\xba')
+  'utf-8'
+
+  >>> import codecs
+  >>> typegetter.charsetGetter(data=codecs.BOM_UTF16_BE + '\x12\x34')
+  'utf-16be'
+
+  >>> typegetter.charsetGetter(data=codecs.BOM_UTF16_LE + '\x12\x34')
+  'utf-16le'
+
+If the character set cannot be determined, `charsetGetter` returns None.
+
+  >>> typegetter.charsetGetter(content_type='text/plain', data='\xff')
+  >>> typegetter.charsetGetter()
+