Project Phoenix CFO

Robin · November 18, 2010, 7:14pm

Sorry, that's Call For Opinions, not Chief Financial Officer.

One of the things I've regretted in wxPython was the confusion surrounding Unicode/Ansi builds and the way that wxPython auto-converts between them as needed. Add to that the fact that the default encoding that wxPython uses is not necessarily the same as Python's default auto-convert encoding (wxPython's is the default encoding for the locale, Python's is usually ascii) then it becomes even more of a mess.

Both wxWidgets and Python have made it clear that the future is Unicode strings only. This is IMO a good thing and I've dropped the ansi builds entirely for the 2.9 series and only Unicode values[1] will be passed to or returned from the wx C++ API from wxPython. I've been thinking that it would be nice to also remove the confusion about the auto-converting of the strings passed to wxPython APIs to Unicode objects.

My proposal is to always assume that string objects are utf-8 encoded and to explicitly use that encoding to convert string objects to unicode objects in the wrapper code. If a program needs to deal with text in another encoding then the programmer can decode it to a unicode value before passing it on to wx. Please share any positive or negative opinions you may have about this, or alternative ideas if any.

[1] Actually it depends on platform as in some cases a wxString will hold utf-8 encoded character data if that is the standard for the native UI API, but that is almost entirely transparent to the users of the wxString class.

···

--
Robin Dunn
Software Craftsman
http://wxPython.org

Matthias · November 18, 2010, 7:22pm

Assuming strings are utf8 encoded sounds fine to me. Another alternative would be to accept unicode strings only, but I guess that's a bit radical.

-Matthias

···

Am 18.11.2010, 20:14 Uhr, schrieb Robin Dunn <robin@alldunn.com>:

Sorry, that's Call For Opinions, not Chief Financial Officer.

One of the things I've regretted in wxPython was the confusion surrounding Unicode/Ansi builds and the way that wxPython auto-converts between them as needed. Add to that the fact that the default encoding that wxPython uses is not necessarily the same as Python's default auto-convert encoding (wxPython's is the default encoding for the locale, Python's is usually ascii) then it becomes even more of a mess.

Both wxWidgets and Python have made it clear that the future is Unicode strings only. This is IMO a good thing and I've dropped the ansi builds entirely for the 2.9 series and only Unicode values[1] will be passed to or returned from the wx C++ API from wxPython. I've been thinking that it would be nice to also remove the confusion about the auto-converting of the strings passed to wxPython APIs to Unicode objects.

My proposal is to always assume that string objects are utf-8 encoded and to explicitly use that encoding to convert string objects to unicode objects in the wrapper code. If a program needs to deal with text in another encoding then the programmer can decode it to a unicode value before passing it on to wx. Please share any positive or negative opinions you may have about this, or alternative ideas if any.

Cody · November 18, 2010, 8:21pm

Hi,

Sorry, that's Call For Opinions, not Chief Financial Officer.

One of the things I've regretted in wxPython was the confusion surrounding
Unicode/Ansi builds and the way that wxPython auto-converts between them as
needed. Add to that the fact that the default encoding that wxPython uses
is not necessarily the same as Python's default auto-convert encoding
(wxPython's is the default encoding for the locale, Python's is usually
ascii) then it becomes even more of a mess.

Both wxWidgets and Python have made it clear that the future is Unicode
strings only. This is IMO a good thing and I've dropped the ansi builds
entirely for the 2.9 series and only Unicode values[1] will be passed to or
returned from the wx C++ API from wxPython. I've been thinking that it
would be nice to also remove the confusion about the auto-converting of the
strings passed to wxPython APIs to Unicode objects.

My proposal is to always assume that string objects are utf-8 encoded and
to explicitly use that encoding to convert string objects to unicode objects
in the wrapper code. If a program needs to deal with text in another
encoding then the programmer can decode it to a unicode value before passing
it on to wx. Please share any positive or negative opinions you may have
about this, or alternative ideas if any.

Assuming strings are utf8 encoded sounds fine to me. Another alternative
would be to accept unicode strings only, but I guess that's a bit radical.

The proposal sounds fine, but I would also like to second the option
of throwing an assertion if non Unicode data is passed in as that
would help to prevent programming errors in client code. Though, I can
see that that could cause a lot of backward incompatibility issues and
possibly confusion for a large percentage of users that are unfamiliar
with the difference between strings and unicode that could make such a
change undesirable.

Cody

···

On Thu, Nov 18, 2010 at 1:22 PM, Matthias <nitrogenycs@googlemail.com> wrote:

Am 18.11.2010, 20:14 Uhr, schrieb Robin Dunn <robin@alldunn.com>:

Nat_Echols · November 18, 2010, 8:25pm

If this is done it should be an optional feature that programmers may choose to enable - otherwise it is most certainly going to create a great deal of confusion, and add unnecessary verbosity to code.

-Nat

···

On Thu, Nov 18, 2010 at 12:21 PM, Cody Precord codyprecord@gmail.com wrote:

The proposal sounds fine, but I would also like to second the option

of throwing an assertion if non Unicode data is passed in as that

would help to prevent programming errors in client code. Though, I can

see that that could cause a lot of backward incompatibility issues and

possibly confusion for a large percentage of users that are unfamiliar

with the difference between strings and unicode that could make such a

change undesirable.

Robin · November 18, 2010, 8:30pm

Yes, that is my feeling as well. I expect that there will be enough differences that not adding one more would be nice. There will be an UnicodeDecodeError raised if the string is not utf-8, so that will help make the developer aware of problems if they are testing with non-english text. I suppose that in a future release after some transition period we could raise a warning when converting a string.

···

On 11/18/10 12:21 PM, Cody Precord wrote:

Hi,

On Thu, Nov 18, 2010 at 1:22 PM, Matthias<nitrogenycs@googlemail.com> wrote:

Am 18.11.2010, 20:14 Uhr, schrieb Robin Dunn<robin@alldunn.com>:

Sorry, that's Call For Opinions, not Chief Financial Officer.

One of the things I've regretted in wxPython was the confusion surrounding
Unicode/Ansi builds and the way that wxPython auto-converts between them as
needed. Add to that the fact that the default encoding that wxPython uses
is not necessarily the same as Python's default auto-convert encoding
(wxPython's is the default encoding for the locale, Python's is usually
ascii) then it becomes even more of a mess.

Both wxWidgets and Python have made it clear that the future is Unicode
strings only. This is IMO a good thing and I've dropped the ansi builds
entirely for the 2.9 series and only Unicode values[1] will be passed to or
returned from the wx C++ API from wxPython. I've been thinking that it
would be nice to also remove the confusion about the auto-converting of the
strings passed to wxPython APIs to Unicode objects.

My proposal is to always assume that string objects are utf-8 encoded and
to explicitly use that encoding to convert string objects to unicode objects
in the wrapper code. If a program needs to deal with text in another
encoding then the programmer can decode it to a unicode value before passing
it on to wx. Please share any positive or negative opinions you may have
about this, or alternative ideas if any.

Assuming strings are utf8 encoded sounds fine to me. Another alternative
would be to accept unicode strings only, but I guess that's a bit radical.

The proposal sounds fine, but I would also like to second the option
of throwing an assertion if non Unicode data is passed in as that
would help to prevent programming errors in client code. Though, I can
see that that could cause a lot of backward incompatibility issues and
possibly confusion for a large percentage of users that are unfamiliar
with the difference between strings and unicode that could make such a
change undesirable.

--
Robin Dunn
Software Craftsman

Cody · November 18, 2010, 8:32pm

Hi,

The proposal sounds fine, but I would also like to second the option
of throwing an assertion if non Unicode data is passed in as that
would help to prevent programming errors in client code. Though, I can
see that that could cause a lot of backward incompatibility issues and
possibly confusion for a large percentage of users that are unfamiliar
with the difference between strings and unicode that could make such a
change undesirable.

If this is done it should be an optional feature that programmers may choose
to enable - otherwise it is most certainly going to create a great deal of
confusion, and add unnecessary verbosity to code.

I don't see how it would add any verbosity what so ever. If you are
writing a Unicode application the only points that you should ever
need to do conversion from string to unicode or visa versa is at the
I/O points of your application. Everything that you are doing
internally should be purely unicode. Having assertions checking for
this would quickly point you to where you have been inconsistent and
allow you to correct a possible defect in your software.

My biggest concern with such a change would be the compatibility with
already written software that is doing things like wx.Button(parent,
label="Hello") would get broken.

Cody

···

On Thu, Nov 18, 2010 at 2:25 PM, Nat Echols <nathaniel.echols@gmail.com> wrote:

On Thu, Nov 18, 2010 at 12:21 PM, Cody Precord <codyprecord@gmail.com> > wrote:

Matthias · November 18, 2010, 9:07pm

I guess it would be nice if the string conversion behaviour could be changed via a setting with three possibilities:

1) Do utf-8 conversion (default)
2) Issue DeprecationWarning (or some other warning) for all non-unicode strings. Nonetheless do the same that option 1) does.
3) Raise some exception for all non-unicode strings

To preserve backwards compatibility option 1) would be the default. Individual developers can check for problems in their code with 2) without affecting anything. Developers really caring about unicode correctness in their application can choose option 3). It would be nice to change this option at runtime.

On python 3 option 3 should be default.

It might be desirable to switch the default mode to 2) in a future release and a few releases later switch to 3) as default.

I am not sure how hard this is to realize on the implementation side or how (if) it affects performance.

-Matthias

···

Am 18.11.2010, 21:32 Uhr, schrieb Cody Precord <codyprecord@gmail.com>:

Hi,

On Thu, Nov 18, 2010 at 2:25 PM, Nat Echols <nathaniel.echols@gmail.com> > wrote:

On Thu, Nov 18, 2010 at 12:21 PM, Cody Precord <codyprecord@gmail.com> >> wrote:

The proposal sounds fine, but I would also like to second the option
of throwing an assertion if non Unicode data is passed in as that
would help to prevent programming errors in client code. Though, I can
see that that could cause a lot of backward incompatibility issues and
possibly confusion for a large percentage of users that are unfamiliar
with the difference between strings and unicode that could make such a
change undesirable.

If this is done it should be an optional feature that programmers may choose
to enable - otherwise it is most certainly going to create a great deal of
confusion, and add unnecessary verbosity to code.

I don't see how it would add any verbosity what so ever. If you are
writing a Unicode application the only points that you should ever
need to do conversion from string to unicode or visa versa is at the
I/O points of your application. Everything that you are doing
internally should be purely unicode. Having assertions checking for
this would quickly point you to where you have been inconsistent and
allow you to correct a possible defect in your software.

My biggest concern with such a change would be the compatibility with
already written software that is doing things like wx.Button(parent,
label="Hello") would get broken.

Chris_Barker1 · November 18, 2010, 9:38pm

My proposal is to always assume that string objects are utf-8 encoded
and to explicitly use that encoding to convert string objects to unicode
objects in the wrapper code.

Sounds good to me, though I'm tempted to say "only unicode". However, Cody is right, there is a lot of code out there (and in mymachine!) that has plain strings as literals. I assume the goal of assuming utf-8 is that plain ascii will still work fine?

···

On 11/18/10 11:14 AM, Robin Dunn wrote:

On 11/18/10 12:30 PM, Robin Dunn wrote:

There will be an
UnicodeDecodeError raised if the string is not utf-8,

I don't think there is a way to know if a string is utf-8 or not -- only if it is valid utf-8. I'm pretty sure that a particular string of bytes can be valid utf-8, but not give the user what they expect -- some ansi strings, for instance. or am I wrong? Anyway, you'll certainly sometimes get the error you should.

-Chris

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Robin · November 18, 2010, 10:28pm

My proposal is to always assume that string objects are utf-8 encoded
and to explicitly use that encoding to convert string objects to unicode
objects in the wrapper code.

Sounds good to me, though I'm tempted to say "only unicode". However,
Cody is right, there is a lot of code out there (and in mymachine!) that
has plain strings as literals. I assume the goal of assuming utf-8 is
that plain ascii will still work fine?

Yes.

There will be an
UnicodeDecodeError raised if the string is not utf-8,

I don't think there is a way to know if a string is utf-8 or not -- only
if it is valid utf-8.

Yes, that is what I meant.

I'm pretty sure that a particular string of bytes
can be valid utf-8, but not give the user what they expect -- some ansi
strings, for instance. or am I wrong?

I've also seen lots of strings with iso-8859-1 encoding fail when assuming that it is utf-8. But you're right that there are likely byte sequences that could be valid in either.

Anyway, you'll certainly sometimes
get the error you should.

Yep.

···

On 11/18/10 1:38 PM, Christopher Barker wrote:

On 11/18/10 11:14 AM, Robin Dunn wrote:
On 11/18/10 12:30 PM, Robin Dunn wrote:

--
Robin Dunn
Software Craftsman

Robin · November 18, 2010, 10:29pm

Performance overhead for the successful code-path (no warning or exception needs to be raised) would be minor, as it would just be a few additional conditional statements in the C++ code. However the complexity of the code would go up and while that is probably obvious to most people it also means that there would be a tendency to not want to touch it at all (enhancing, changing defaults, etc.) in the future after things are confirmed to be working. :-/

···

On 11/18/10 1:07 PM, Matthias wrote:

I guess it would be nice if the string conversion behaviour could be
changed via a setting with three possibilities:

1) Do utf-8 conversion (default)
2) Issue DeprecationWarning (or some other warning) for all non-unicode
strings. Nonetheless do the same that option 1) does.
3) Raise some exception for all non-unicode strings

To preserve backwards compatibility option 1) would be the default.
Individual developers can check for problems in their code with 2)
without affecting anything. Developers really caring about unicode
correctness in their application can choose option 3). It would be nice
to change this option at runtime.

On python 3 option 3 should be default.

It might be desirable to switch the default mode to 2) in a future
release and a few releases later switch to 3) as default.

I am not sure how hard this is to realize on the implementation side or
how (if) it affects performance.

--
Robin Dunn
Software Craftsman

Matthias · November 19, 2010, 4:26pm

I think you misunderstood Robin's proposal. The problem is there is currently code like this

ShowMessageBox( 'Hello world' )

In python 2.x type('Hello world') == str. The prototype of ShowMessageBox in C++ would looks like ShowMessageBox( wxString message ); where wxString is a unicode.

So you have a problem: The user supplies only a "str" object, but the method needs a "unicode" object. Now you could just raise an exception that they do not match. But then all the currently existing user applications which use ShowMessageBox('Hello') will completely break with a new wxPython version.

That's why Robin proposed to do an implicit conversion which assumes that the str object is in a utf8 format. Then the wxPython code can go and convert the str object to a unicode object by doing something like message.decode('utf-8') . This way all the currently existing applications will continue to work.

Now one can argue whether it's worth making a change which breaks almost all applications out there or if it's not worth it. That's part of the reason why Robin posted the proposal here I think.

-Matthias

P.S.: The io module can just act in this way, because it doesn't have to concern itself with backwards compatibility.

···

Am 19.11.2010, 17:07 Uhr, schrieb jmfauth <wxjmfauth@gmail.com>:

Proposing to use encoded strings instead of 'unicode' types,
is a symptom of this desease and finally a missunderstanding
of Python. Clearly, a design mistake.

Robin · November 19, 2010, 6:07pm

Proposing to use encoded strings instead of 'unicode' types,
is a symptom of this desease and finally a missunderstanding
of Python. Clearly, a design mistake.

I think you misunderstood Robin's proposal.

Agreed. Just to recap, here is how things are currently in 2.9.1 or in a 2.8.x Unicode build of wxPython:

* Python's default conversion encoding is almost always 'ascii'

* wxPython picks a default conversion encoding from the locale settings that are active at the time that wx is imported. This will usually be something different than 'ascii'. You can see what yours is with the wx.GetDefaultPyEncoding function.

     >>> import wx
     >>> wx.GetDefaultPyEncoding()
     'mac-roman'

* wxPython uses this encoding to convert strings to Unicode objects prior to converting them to a wxString to pass to the wx API.

* In some cases wxString holds a utf-8 encoded byte string instead of an array of wide characters. In those cases I don't do a conversion of str to unicode if the current wxPython conversion encoding is also utf-8.

* The code to do all this is rather complex, you can see it here in the wxString_in_helper function: wxTrac has been migrated to GitHub Issues - wxWidgets Note that there are also some related functions there down to around line 2385.

One of the driving forces of the Phoenix project is to reduce complexity and to make wxPython much easier to develop and to maintain. My proposal is as follows:

* Remove the code needed for the wx ansi build, since they don't exist any more.

* Remove the code that deals with or uses the wxPyDefaultEncoding and always assume that any str objects passed to a wx API are encoded as utf-8 (or ascii) and use the utf-8 encoding to convert it to unicode if needed.

* If a unicode object is passed to the API then continue to use it directly for the conversion to wxString.

···

On 11/19/10 8:26 AM, Matthias wrote:

Am 19.11.2010, 17:07 Uhr, schrieb jmfauth <wxjmfauth@gmail.com>:

--
Robin Dunn
Software Craftsman

Jean-Michel_Fauth1 · November 19, 2010, 7:55pm

I certainly missed something and the last explanations
helped. I understood that every string object has to be
encoded in utf-8 in order to be passed to "wxPhoenix".

jmf

···

On Nov 19, 7:07 pm, Robin Dunn <ro...@alldunn.com> wrote:

On 11/19/10 8:26 AM, Matthias wrote:

> Am 19.11.2010, 17:07 Uhr, schrieb jmfauth <wxjmfa...@gmail.com>:

>> Proposing to use encoded strings instead of 'unicode' types,
>> is a symptom of this desease and finally a missunderstanding
>> of Python. Clearly, a design mistake.

> I think you misunderstood Robin's proposal.

Agreed.

Chris_Barker1 · November 19, 2010, 9:14pm

* Remove the code that deals with or uses the wxPyDefaultEncoding

+1

and
always assume that any str objects passed to a wx API are encoded as
utf-8 (or ascii) and use the utf-8 encoding to convert it to unicode if
needed.

Here I'm not so sure anymore. It seems the main barrier to requiring unicode is the massive amount of string literals people have used that are not unicode.

So: how likely is it that there will be strings that are encoded in utf-8? I think very unlikely. ascii, in the other hand is ubiquitous.

So I suggest that wxPython attempt to do a ascii -> wxString conversion, and raise an exception.

utf-8 is a superset to ascii, so I can see why that seems like a good option. But an Exception error is better than a bug, and if you assume utf-8, you could get stuff passed through that decodes successfully, but means something other that what the user intended.

I would feel differently if (and I show my American-English-centric ignorance here)

latin-1 or other common single byte encodings can pass through utf-8 encoding either correctly or raise an Exception.

or

There are a lot of utf-8 encoded strings out there in the wild -- perhaps as literal where someone set the encoding at the top of the module -- is that the case?

Use case:

In our work here, we have an ugly mess of text files in ascii, mac-roman, microsoft-roman (or whatever they call that), and latin-1

whenever we work with this data, and assume a given encoding, we get various symbols processed wrong, but without error: the degree symbol, accented characters, etc. I'd much rather raise Exceptions than get the wrong result.

-Chris

···

On 11/19/10 10:07 AM, Robin Dunn wrote:

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Ben_Morgan · November 19, 2010, 10:18pm

UTF-8 is a good codec to use in that most random byte strings will not be valid utf-8. Ascii strings will be valid utf-8, which is desirable; almost all latin1 won’t be, and will raise exceptions, which is what you want.

For my purposes knowing that you can pass in utf-8 would be nice for working on cross-platform code (I’ve had times in the past where utf-8 worked on GTK but looked bad on MSW).

God Bless,
Ben

···

On Sat, Nov 20, 2010 at 8:14 AM, Christopher Barker Chris.Barker@noaa.gov wrote:

utf-8 is a superset to ascii, so I can see why that seems like a good option. But an Exception error is better than a bug, and if you assume utf-8, you could get stuff passed through that decodes successfully, but means something other that what the user intended.

I would feel differently if (and I show my American-English-centric ignorance here)

latin-1 or other common single byte encodings can pass through utf-8 encoding either correctly or raise an Exception.

or

There are a lot of utf-8 encoded strings out there in the wild – perhaps as literal where someone set the encoding at the top of the module – is that the case?

Multitudes, multitudes,
in the valley of decision!
For the day of the LORD is near
in the valley of decision.

Giôên 3:14 (ESV)

Chris_Barker1 · November 19, 2010, 11:04pm

UTF-8 is a good codec to use in that most random byte strings will not
be valid utf-8. Ascii strings will be valid utf-8, which is desirable;

yup.

almost all latin1 won't be, and will raise exceptions, which is what you
want.

In that case, why except utf-8 at all? why not just ascii? That way you'd be less likely to accidentally accept an error.

For my purposes knowing that you can pass in utf-8 would be nice for
working on cross-platform code (I've had times in the past where utf-8
worked on GTK but looked bad on MSW).

huh? If you are thinking about unicode, you should be using unicode objects anyway, rather than strings with utf-8 in them. As has been pointed out the encoding/decoding should happen on I/O, period.

The only reason this is even being considered is there a lot of legacy stuff that is not unicode-aware.

-Chris

···

On 11/19/10 2:18 PM, Ben Morgan wrote:

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Ben_Morgan · November 19, 2010, 11:24pm

almost all latin1 won’t be, and will raise exceptions, which is what you

want.

In that case, why except utf-8 at all? why not just ascii? That way you’d be less likely to accidentally accept an error.
You’d have to be pretty unlucky to accidentally get an error, I think (I could be wrong, but whenever I’ve tried I can tell if text is utf-8 or not by encoding it and checking for errors - dealing with latin1/cp1252 text)

For my purposes knowing that you can pass in utf-8 would be nice for

working on cross-platform code (I’ve had times in the past where utf-8

worked on GTK but looked bad on MSW).

huh? If you are thinking about unicode, you should be using unicode objects anyway, rather than strings with utf-8 in them. As has been pointed out the encoding/decoding should happen on I/O, period.

The only reason this is even being considered is there a lot of legacy stuff that is not unicode-aware.
UTF-8 is very convenient. In my case, one of the main libraries I use uses utf-8 throughout. UTF-8 also comes from a number of other places (it’s a very pervasive format), and it is simpler to use it without conversion if possible. Maybe things get better in python 3, but in python 2 it would be very convenient to use utf-8 more.

God Bless,
Ben

···

On Sat, Nov 20, 2010 at 10:04 AM, Christopher Barker Chris.Barker@noaa.gov wrote:

Multitudes, multitudes,
in the valley of decision!

For the day of the LORD is near
in the valley of decision.

Giôên 3:14 (ESV)

Chris_Barker1 · November 19, 2010, 11:47pm

In that case, why except utf-8 at all? why not just ascii? That way
you'd be less likely to accidentally accept an error.

You'd have to be pretty unlucky to accidentally get an error, I think (I
could be wrong, but whenever I've tried I can tell if text is utf-8 or
not by encoding it and checking for errors - dealing with latin1/cp1252
text)

Then why not just stick with ascii? Though if your experience is common then it makes little difference.

UTF-8 is very convenient. In my case, one of the main libraries I use
uses utf-8 throughout. UTF-8 also comes from a number of other places
(it's a very pervasive format),

sure.

and it is simpler to use it without
conversion if possible.

really? You really want to be slinging strings around with utf-8 in them in your python code? (or bytes objects with utf-8 in py3k). I can't imagine that's a good practice.

Maybe you've got a use-case I'm not imagining.

-Chris

···

On 11/19/10 3:24 PM, Ben Morgan wrote:

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Ben_Morgan · November 19, 2010, 11:57pm

I guess part of it is that in the current wxPython + python 2.x, it is possible if developing on linux/Mac to use utf-8 strings unknowingly as they will transparently work, and it is only once on Windows you find they don’t. Accepting only Unicode would stop this, I guess, as would mandating utf-8. Personally I’m not too worried either way; making it require unicode means one more format conversion each way (and then one inside wxPython to put it back into utf-8 potentially).

God Bless,
Ben

···

On Sat, Nov 20, 2010 at 10:47 AM, Christopher Barker Chris.Barker@noaa.gov wrote:

and it is simpler to use it without

conversion if possible.

really? You really want to be slinging strings around with utf-8 in them in your python code? (or bytes objects with utf-8 in py3k). I can’t imagine that’s a good practice.

Maybe you’ve got a use-case I’m not imagining.

Multitudes, multitudes,
in the valley of decision!

For the day of the LORD is near
in the valley of decision.

Giôên 3:14 (ESV)

Jean-Michel_Fauth1 · November 20, 2010, 11:49am

Hi,

After the last comments, I'm still confused.
Can somebody confirm my understanding with this
practical example? I wish to pass the following
*text*, "éléphant" to "wxPhoenix".

If I'm passing "éléphant" as

1) "éléphant", type 'str', coding cp1252, iso-8859-1,
iso-8859-15, cp850, mac-roman. This will fails because
"éléphant"
- is not an ascii byte string
- is not an utf-8 byte string
- is not a unicode, a Python 'unicode' type

2) "\xc3\xa9l\xc3\xa9phant", type 'str', utf-8.
Success because
- it is a type 'str'
- it is an utf-8 byte string

3) u"éléphant", type 'unicode', (coding does not count)
Success because
- it is a Python 'unicode' type

4) "\x00\xe9\x00l\x00\xe9\x00p\x00h\x00a\x00n\x00t",
type 'str', utf-16-be
It fails because it
- is not an ascii byte string
- is not an utf-8 byte string
- is not a unicode, a Python 'unicode' type

5) "\x00\x00\x00\xe9\x00\x00\x00l\x00\x00\x00\xe9
\x00\x00\x00p\x00\x00\x00h\x00\x00\x00a\x00\x00\x00
n\x00\x00\x00t", type 'str', utf-32-be
It fails because it
- like 4)

Note : ascii byte string == a string containing only
bytes supposed to represent ascii valid "code points"
/ characters.

=== [ not/less "wxPhoenix" related stuff ] ===============

Chris Barker

I would feel differently if (and I show my American-English-centric
ignorance here)

This is always a little bit the problem when discussing
characters codings. For most "American-English-centric"
people, unicode == utf-8 == ascii. Unfortunately, this
a wrong understanding.

The usage of the type 'unicode' in Python 2 is very
common for non "American-English-centric" users
(I belong to them). We are intensively using
'unicode' type, not because of modernity, but because
it is in Python 2 the unique way to manipulate strings.
The 'unicode' type is the pivot for all the codings
manipulations. A short example again with the word
"éléphant". If I wish to use this word in a suitable
form, I need to create a 'unicode' type first.

1) "éléphant" in a encoded source (file, database, input,
editor, gui, ...)
2) u = unicode("éléphant", "source coding")
3) u.encode("target encoding") (file, database, output,
gui, ...)

So I suggest that wxPython attempt to do a ascii -> wxString conversion,
and raise an exception.

This may be a solution, but you may fall in the usual
annoying Python trap.

# logically ok
'abc'.encode('utf-8')

abc

# logically fails
'abcé'.encode('utf-8')

Traceback (most recent call last):
File "<psi last command>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in
position 3: ordinal not in range(128)

# but this is ok
s = 'abcé'
u = unicode(s, 'cp1252')
u.encode('utf-8')

abcÃ©

repr(u.encode('utf-8'))

'abc\xc3\xa9'

In our work here, we have an ugly mess of text files in ascii,
mac-roman, microsoft-roman (or whatever they call that), and latin-1

Yes, it may be a mess. Python is really shining to handle all
these codings. The usual problem is not of the side of the
misc. codings, the problem is that people do not understand
all this coding stuff, ... when there are aware a text lives
in a "encoded form".

huh? If you are thinking about unicode, you should be using unicode
objects anyway, rather than strings with utf-8 in them. As has been
pointed out the encoding/decoding should happen on I/O, period.

Correct. See my example above. If you wish to live in a
unicode world, you have to use unicode. And even in Python 2,
the only way to achieve that is to solely use 'unicode' types.
That's why I'm firmly convinced "wxPhoenix" should handle/pass
solely Python 'unicode' type strings (and not support encoded
forms).
(BTW, you are a little bit contradicting youself ...)

Ben Morgan

(I've had times in the past where utf-8 worked on GTK

but looked bad on MSW).

The coding of the characters is a domain per se and
it is completely independant from any platform. Sure,
every platform offers its "preferred enconding", basically
the coding has nothing to do with a platform.

UTF-8 is very convenient. In my case, one of the main
libraries I use uses utf-8 throughout.

You are confusing unicode and utf-8. utf-8 is a good
encoding for streamed texts. For string manipulations,
it is a catastroph, the worth of all existing codings.
Python uses ucs2/ucs4 as unicode internal coding.
As far as I know, Microsoft, Java, XeTeX are using utf-16.
gcc and libs use ucs4 (4 bytes w_char); (I'm not a C
specialist).

making it require unicode means one more format conversion
each way

No, this is the opposite. If you are working with 'unicode'
types, there are no conversion at all. You are not thinking
Unicode, but you are taking the problem from the other (wrong)
side:
"I have a byte string (utf-8), now I should create a unicode,
why?"

Matthias

Now one can argue whether it's worth making a change which breaks almost
all applications out there or if it's not worth it. That's part of the
reason why Robin posted the proposal here I think.

The holy compatibility. Face it, the unicode world in not compatible
with the byte string world (except for American people). This is
a fact and you can not escape from that.

See the numerous discussions on the Python dev mailing
regarding Python 2 / Python 3.

···

-----

Personal experience from a TeX user. I dropped LaTeX in favour
of XeTeX (the new full unicode compliant TeX engine, incompatible
with LaTeX). I had to work in unicode, to think unicode and I
do not regret this move.

jmf