Unicode confusion

I agree that it is a confusing concept. I'm pretty clear on the concepts, although I still stumble over the implementations in Python. Still, that isn't going to stop me from giving a speech on the subject.

Unicode is the universal alphabet. It is represented in an ISO standard, ISO-10646. Every character in every language on Earth (theoretically) exists (or will exist) in Unicode. The Latin e-with-grave-accent character has been assigned the Unicode codepoint U+00E8. That statement is true worldwide, no matter what keyboard or language or font or encoding you use.

The question then becomes, how do we represent the Unicode codepoint U+00E8 in a computer? The representations are called "encodings". The full Unicode standard requires 32-bit characters. If you have a string with 32-bit characters where each is a Unicode code point, you have a UCS-4 encoding.

That encoding tends to be rather wasteful of memory. At this point, there are still few Unicode code points outside of the first "plane" of 65,536 characters, so a 16-bit encoding seems natural. If you have a string with 16-bit characters where each is a Unicode code point truncated to 16-bits, you have a UCS-2 encoding. This is the encoding that Windows refers to as "Unicode", and it is the encoding that you get in Python when you have variable of type unicode(). (Although I believe Python can be compiled to use 4-byte strings as well.)

If you are working mostly with Latin characters, even this is wasteful of memory. Thus, we have the strong desire to encode strings of 8-bit characters. Here is where the encoding thing gets icky. There are a whole bunch of 8-bit encodings, but the same 8-bit values mean very different things in the encodings. The lowest 128 glyphs usually map to ASCII (which are also the lowest 128 codepoints in Unicode), but the upper 128 are wild. In the popular iso-8859-1 encoding, called "Latin-1", the e-with-grave-accent is 0xE8, but it is only a coincidence that this is the same as its Unicode codepoint. In the Mac encoding you mentioned, apparently e-with-grave-accent is 0x8E. The Python string.encode function converts an 8-bit string to Unicode, but to do that, it must know which encoding the 8-bit string belongs to. The 8-bit character 0xE8 doesn't mean anything by itself, and it cannot be converted to Unicode. However, the 8-bit character 0xE8 in the iso-8859-1 encoding CAN be converted to Unicode.

The UTF-8 encoding is an interesting beast. It can represent ANY Unicode code point (all 32 bits) in an 8-bit string. To do that, some characters have to be bigger than 8 bits. In UTF-8, a single character can be from 1 to 6 characters long. The higher the Unicode code point number, the more bytes are required. The actual encoding scheme is in RFC 2279.

The Latin e-with-grave-accent is represented in UTF-8 by the two-byte sequence \xC3\xA8. Once again, however, this two-byte sequence has no meaning at all, unless we somehow associate it with the UTF-8 encoding.

That's the basic principle. If you need to represent extended-character strings in a Python program, you have two choices:
  * use an 8-bit encoding but SPECIFY that encoding, or
  * use Unicode (USC-2) literal strings.

ยทยทยท

On Tue, 12 Apr 2005 11:21:51 -0400, Charles Hartman <charles.hartman@conncoll.edu> wrote:

Yes, and it works. Thanks. I'm still not very clear about why it works
(*without* the encode() line) or why the other one worked (*with* it).
I see the general principle, but I'm confused on details. In 'utf8', a
Unicode encoding, e-with-grave seems to be \xc3\xa8, while apparently
as you say "'unicode character number 0xe8' . . . happens to be
'e-grave'." Apparently it's the relation between Unicode and encodings
that I'm still boggling over.

--
- Tim Roberts, timr@probo.com
  Providenza & Boekelheide, Inc.