Hi,
After the last comments, I'm still confused.
Can somebody confirm my understanding with this
practical example? I wish to pass the following
*text*, "éléphant" to "wxPhoenix".
If I'm passing "éléphant" as
1) "éléphant", type 'str', coding cp1252, iso-8859-1,
iso-8859-15, cp850, mac-roman. This will fails because
"éléphant"
- is not an ascii byte string
- is not an utf-8 byte string
- is not a unicode, a Python 'unicode' type
2) "\xc3\xa9l\xc3\xa9phant", type 'str', utf-8.
Success because
- it is a type 'str'
- it is an utf-8 byte string
3) u"éléphant", type 'unicode', (coding does not count)
Success because
- it is a Python 'unicode' type
4) "\x00\xe9\x00l\x00\xe9\x00p\x00h\x00a\x00n\x00t",
type 'str', utf-16-be
It fails because it
- is not an ascii byte string
- is not an utf-8 byte string
- is not a unicode, a Python 'unicode' type
5) "\x00\x00\x00\xe9\x00\x00\x00l\x00\x00\x00\xe9
\x00\x00\x00p\x00\x00\x00h\x00\x00\x00a\x00\x00\x00
n\x00\x00\x00t", type 'str', utf-32-be
It fails because it
- like 4)
Note : ascii byte string == a string containing only
bytes supposed to represent ascii valid "code points"
/ characters.
=== [ not/less "wxPhoenix" related stuff ] ===============
Chris Barker
I would feel differently if (and I show my American-English-centric
ignorance here)
This is always a little bit the problem when discussing
characters codings. For most "American-English-centric"
people, unicode == utf-8 == ascii. Unfortunately, this
a wrong understanding.
The usage of the type 'unicode' in Python 2 is very
common for non "American-English-centric" users
(I belong to them). We are intensively using
'unicode' type, not because of modernity, but because
it is in Python 2 the unique way to manipulate strings.
The 'unicode' type is the pivot for all the codings
manipulations. A short example again with the word
"éléphant". If I wish to use this word in a suitable
form, I need to create a 'unicode' type first.
1) "éléphant" in a encoded source (file, database, input,
editor, gui, ...)
2) u = unicode("éléphant", "source coding")
3) u.encode("target encoding") (file, database, output,
gui, ...)
So I suggest that wxPython attempt to do a ascii -> wxString conversion,
and raise an exception.
This may be a solution, but you may fall in the usual
annoying Python trap.
# logically ok
'abc'.encode('utf-8')
abc
# logically fails
'abcé'.encode('utf-8')
Traceback (most recent call last):
File "<psi last command>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in
position 3: ordinal not in range(128)
# but this is ok
s = 'abcé'
u = unicode(s, 'cp1252')
u.encode('utf-8')
abcé
repr(u.encode('utf-8'))
'abc\xc3\xa9'
In our work here, we have an ugly mess of text files in ascii,
mac-roman, microsoft-roman (or whatever they call that), and latin-1
Yes, it may be a mess. Python is really shining to handle all
these codings. The usual problem is not of the side of the
misc. codings, the problem is that people do not understand
all this coding stuff, ... when there are aware a text lives
in a "encoded form".
huh? If you are thinking about unicode, you should be using unicode
objects anyway, rather than strings with utf-8 in them. As has been
pointed out the encoding/decoding should happen on I/O, period.
Correct. See my example above. If you wish to live in a
unicode world, you have to use unicode. And even in Python 2,
the only way to achieve that is to solely use 'unicode' types.
That's why I'm firmly convinced "wxPhoenix" should handle/pass
solely Python 'unicode' type strings (and not support encoded
forms).
(BTW, you are a little bit contradicting youself ...)
Ben Morgan
(I've had times in the past where utf-8 worked on GTK
but looked bad on MSW).
The coding of the characters is a domain per se and
it is completely independant from any platform. Sure,
every platform offers its "preferred enconding", basically
the coding has nothing to do with a platform.
UTF-8 is very convenient. In my case, one of the main
libraries I use uses utf-8 throughout.
You are confusing unicode and utf-8. utf-8 is a good
encoding for streamed texts. For string manipulations,
it is a catastroph, the worth of all existing codings.
Python uses ucs2/ucs4 as unicode internal coding.
As far as I know, Microsoft, Java, XeTeX are using utf-16.
gcc and libs use ucs4 (4 bytes w_char); (I'm not a C
specialist).
making it require unicode means one more format conversion
each way
No, this is the opposite. If you are working with 'unicode'
types, there are no conversion at all. You are not thinking
Unicode, but you are taking the problem from the other (wrong)
side:
"I have a byte string (utf-8), now I should create a unicode,
why?"
Matthias
Now one can argue whether it's worth making a change which breaks almost
all applications out there or if it's not worth it. That's part of the
reason why Robin posted the proposal here I think.
The holy compatibility. Face it, the unicode world in not compatible
with the byte string world (except for American people). This is
a fact and you can not escape from that.
See the numerous discussions on the Python dev mailing
regarding Python 2 / Python 3.
···
-----
Personal experience from a TeX user. I dropped LaTeX in favour
of XeTeX (the new full unicode compliant TeX engine, incompatible
with LaTeX). I had to work in unicode, to think unicode and I
do not regret this move.
jmf