Unicode Quirks in Django
Python 2.x's Unicode implementation really leaves something to be desired. The major issue often arises when you need to go between str() and unicode() types, the former is a 8-bit character while the latter is a 16-bit character. The problem is that doing .encode('utf-8') on a unicode object is idempotent (i.e. u'\u2013t'.encode('utf-8') but doing it on a str() twice will cause Python to trigger ascii codec errors.
Here's a great introduction of troubleshooting Unicode issues: http://collective-docs.readthedocs.org/en/latest/troubleshooting/unicode.html There's a great PowerPoint slide about demystifying Unicode in Python, which should be required reviewing. It's more detailed about the complexities of UTF-encoding, but it's worthwhile to review. http://farmdev.com/talks/unicode/ One of the general rule of thumbs that you'll get from this talk is 1) decode early 2) unicode everywhere and 3) encode late. In Django, this approach is closely followed when writing data to the database. You usually don't need to convert your unicode objects because it's being handled at the database layer. Assuming your SQL database is configured properly and your Django settings are set correctly, Django's database layer handles the unicode to UTF-8 conversion seamlessly. For example, just look inside the MySQLdb Python wrapper and right before a query is executed, the entire string is encoded into the specified character set: MySQLdb/cursors.py:if isinstance(query, unicode): query = query.encode(charset) if args is not None:
What if you attempt to use logging.info() on Django objects? (i.e. logging.info("%s" % User.objects.all()[0]) If you searched on Stack Overflow, you'd see a recommendation to create a __str__(self) in your Python classes that call unicode() and convert to UTF-8:
def __str__(self): return unicode(self).encode('utf-8')
Django's base model definitions (django.db.models.base) also follow this convention:
def __str__(self): if hasattr(self, '__unicode__'): return force_unicode(self).encode('utf-8') return '%s object' % self.__class__.__name__
Normally, Python handles string interpolations automatically by determining whether the string is unicode or str() type. Consider these cases:
>>> print type("%s" % a) <type 'str'> print type("%s" % 'hey') <type 'str'> print type("%s" % u'hey') <type 'unicode'>
Assuming your character set on your database is set to UTF-8, consider this example and how Python deals with string interpolations for class. Normally Python does unicode conversions automatically, but for Python classes, "%s" always means to invoke the str() function.
class A(object): def __init__(self): self.tst = u'hello' def __str__(self): if hasattr(self, '__unicode__'): return self.__unicode__().encode('utf-8') return 'hey' def __unicode__(self): return u'hey\u2013t' >>> a = A() >>> print "%s" % a hey-t >>> print "%s %s" % (a, a.tst) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
In this failing case, the problem is that printing the A class results in printing a str() type intermixed with a.tst, which is a unicode type. When this issue happens, you're likely to see the UnicodeDecodeError
# http://www.gossamer-threads.com/lists/python/bugs/842076 def __str__(self): return u'%s object' % self.__class__.__name__
The recommendation is also consistent with this python-dev discussion about how to implement __str__() and __unicode__() methods:
This was added to make the transition to all Unicode in 3k easier: . __str__() may return a string or Unicode object. . __unicode__() must return a Unicode object. There is no restriction on the content of the Unicode string for __str__().
blog comments powered by Disqus