Project Phoenix CFO

Robin · November 21, 2010, 8:40pm

Hi,

After the last comments, I'm still confused.
Can somebody confirm my understanding with this
practical example? I wish to pass the following
*text*, "ï¿½lï¿½phant" to "wxPhoenix".

If I'm passing "ï¿½lï¿½phant" as

1) "ï¿½lï¿½phant", type 'str', coding cp1252, iso-8859-1,
iso-8859-15, cp850, mac-roman. This will fails because
"ï¿½lï¿½phant"
- is not an ascii byte string
- is not an utf-8 byte string
- is not a unicode, a Python 'unicode' type

Correct.

You've actually highlighted one reason why I'm thinking that dropping auto-conversion support for the locale's default encoding is a good idea. There are so many overlaps that some strings can be compatible with multiple encodings and some programmers may assume that if their test cases work with one then they'll work with others. But there are also enough differences that bugs always creep in when the locale's default encoding is different than what they've tested with. By officially supporting only unicode with auto converts from ascii and utf-8 I think we will eliminate a lot of potential bugs with no loss of functionality and with the only cost being a slightly decreased convenience factor.

2) "\xc3\xa9l\xc3\xa9phant", type 'str', utf-8.
Success because
- it is a type 'str'
- it is an utf-8 byte string

Correct.

3) u"ï¿½lï¿½phant", type 'unicode', (coding does not count)
Success because
- it is a Python 'unicode' type

Correct.

4) "\x00\xe9\x00l\x00\xe9\x00p\x00h\x00a\x00n\x00t",
type 'str', utf-16-be
It fails because it
- is not an ascii byte string
- is not an utf-8 byte string
- is not a unicode, a Python 'unicode' type

Correct.

5) "\x00\x00\x00\xe9\x00\x00\x00l\x00\x00\x00\xe9
\x00\x00\x00p\x00\x00\x00h\x00\x00\x00a\x00\x00\x00
n\x00\x00\x00t", type 'str', utf-32-be
It fails because it
- like 4)

Correct.

Note : ascii byte string == a string containing only
bytes supposed to represent ascii valid "code points"
/ characters.

Yes, in other words the characters represented by the lowest 7 bits.

I've been working on the new code for the wxString conversion a bit this weekend. I'll attach the unittest TestCase I'm using to verify the conversions. I used your example texts from above.

BTW, with the reduced complexity and with some newer functionality in wxString I've reduced the typemap conversion code down to about 30 lines of easily understood C++.

test_string.py (2.55 KB)

···

On 11/20/10 3:49 AM, jmfauth wrote:

--
Robin Dunn
Software Craftsman

Jean-Michel_Fauth1 · November 22, 2010, 6:53am

Ok, I got it correctly. I'm still convinced you are taking
the wrong route. See, the next message.

···

On Nov 21, 9:40 pm, Robin Dunn <ro...@alldunn.com> wrote:

On 11/20/10 3:49 AM, jmfauth wrote:

I've been working on the new code for the wxString conversion a bit this
weekend. I'll attach the unittest TestCase I'm using to verify the
conversions. I used your example texts from above.

BTW, with the reduced complexity and with some newer functionality in
wxString I've reduced the typemap conversion code down to about 30 lines
of easily understood C++.

--

Jean-Michel_Fauth1 · November 22, 2010, 6:55am

To unicode or not to unicode

The problematic is much simpler, that it looks. People wish to make
unicode softfare, but they refused to use unicode. That is the source
of all the troubles. They desperately spend their time and efforts in
creating solutions, which look unicode, smell unicode and are finally
and definitely not unicode.

"wxPhoenix" does not escape to this rule. It is plainly wrong
for several obvious reasons:

- it does not enforces the usage of unicode.
- it assumes "utf-8" as "unicode", this is plainly wrong.
- it is or may be unsafe, what about ill-formed utf-8?
- it is not unicode, why accepting utf-8 and not utf-16 or utf-32?
- it is not unicode, simply because a unicode-encoded str is not
  by nature a unicode.
- it is an American solution for American users.
- compatibilty? The truth: it is only compatible with the codes of
those
  people who are not able to write unicode at the Python level. Most
  of serious and non ascii users are using unicode in their code since
  years, just because there is no other way to do it in a lot of
cases.
- it amplifies the existence of bad and non working wxPython code
  (applications using the wxPython unicode build are good examples).

Example of correct and serious unicode library.

def UpperCase(u):

... if not isinstance(u, unicode):
... raise TypeError('Sorry, this is a unicode library')
... return u.upper()
...

UpperCase('éléphant')

Traceback (most recent call last):
File "<psi last command>", line 1, in <module>
File "<psi last command>", line 3, in UpperCase
TypeError: Sorry, this is a unicode library

UpperCase(u'éléphant')

ÉLÉPHANT

'éléphant'.upper()

éLéPHANT

Note: the above code is not only strictly unicode compliant, it is
also using and forcing valuable Unicode (Unicode with an
uppercase "U") features.

I do not know and do not see a better way to make this code
safer, simpler and cleaner.
Oops, I'm wrong. I should use Python 3 and drop the two
lines if not isinstance... and raise...

Non only my code is unicode compliant, it is readable.
I prefer to to be forced to label a wx.Button with a u'éléphant',
that to see an allowed 'Ã©lÃ©phant' or '\xc3\xa9l\xc3\xa9phant'
- Yes check them, they are valid utf-8 encoded unicodes.
Today (soon year 2011), requesting for human readable texts
sounds just an unacceptable idea.

My years-old code already belongs to the future. "wxPhoenix" as
presented,
not even born, is already belonging to the past.

"wxPhoenix" is supposed to target Python 3. I ironically hope, that it
will accept bytes and bytearray types in order to be compatible with
the
code of those of you, who are not using, are not willing to use or not
able
to use unicode (type 'str' in Python 3).

You should from time to time take a look at French or German fora.
Newbie: Hello, I have problems with accented characters, what should
I do?
Answer: Put a 'u' in front of every string.
Newbie: Thanks, it works.

These unexperimented teen-agers are living in a unicode world, and are
even not aware there are using unicodes (even when using wxPython!)

And you, experimented users, what are doing? You are even not able to
escape form the old byte string world and are proposing ... see the
top
of this message.

···

----

Technically, working with with encoded unicodes is practically
impossible. When one wishes or have to use a unicode, one decode/
create
a unicode as soon as possible and one enocode it at the last minute.

Assume I have a var in utf-8 to populate a widget, eg. the text
of a wx.button. This wx.button magically accepts utf-8 (by design).
Now I wish to modify the the text of this button. I can get
the label of this button, it will be a unicode. Unfortunatelly,
this is wrong, I have logically to touch the var, and this var beeing
in utf-8, to manipulate it, the first thing I have to do is
to decode it, that means creates a unicode. So, why is this
var not "natively" a unicode?
You see, if one will work in a unicode mode, one *has to* use unicode.

I have clearly the feeling, most of you have very little experience
when it comes to work in a "unicode mode".

----

PS

sys.version

2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]

import io
with io.open('a.txt', 'w', encoding='ascii') as f:

... f.write('abc')
...
Traceback (most recent call last):
File "<psi last command>", line 2, in <module>
TypeError: must be unicode, not str

... probably some code coming fron unexperimented Python users...

PS2

Am I allowed to use marked utf-8 strs in "wxPhoenix"? As recognized
by http://unicode.org/ . Or will it be not even 'utf-8' compliant?

jmf

Werner · November 22, 2010, 8:30am

I think in 2.9 it should NOT enforce it by default.

Enforcing it now will make the transition to 2.9 just harder for people who have not already switched to using only u'somestring' in their code, especially the people who don't have control of all the source code they are using.

Just my 0.02ï¿½
Werner

···

On 22/11/2010 07:55, jmfauth wrote:

To unicode or not to unicode

The problematic is much simpler, that it looks. People wish to make
unicode softfare, but they refused to use unicode. That is the source
of all the troubles. They desperately spend their time and efforts in
creating solutions, which look unicode, smell unicode and are finally
and definitely not unicode.

"wxPhoenix" does not escape to this rule. It is plainly wrong
for several obvious reasons:

- it does not enforces the usage of unicode.

Jean-Michel_Fauth1 · November 22, 2010, 3:50pm

Hi Werner,

Do you prefer do be forced to pass your German texts in a
utf-8 encoded form or restrict them to ascii.

Your German *str's* are probably already incompatible with
"wxPhoenix". You will be forced to decode/encode/transcode
them in a suitable form.

That's one point.

Second point. If your code were already in a unicode mode,
it would be already comaptible with wxPhoenix.

You see, you can take the problem the side you wish, the
two worlds unicode and byte string (including encoded unicode)
are by nature incompatible, *Python* developpers have understood
this. To take an another project, XeTex/LuaTeX devs have also
understood this. Even the fonts are not complatible between
TeX and XeTeX, in the XeTeX unicode world you have to use
unicode compatible fonts and these are fonts using the OpenType
technology.

Wener, you are insisting, rightly, on the compatibility.
Now, let'go a little bit in the future and see what may
happen. People will develope their third party wxPhoenix
libraries and as this wxPhoenix will accept encoded strings,
ascii / utf-8 encoded unicodes, a part of these developers
will simply do not bother with unicode, "cool, wxPhoenix swallow
natively my ascii/unicode code. (BTW, this what we are already
reading on this mailing list). Finally, we, the non
ascii users will be stuck again. One more good reason
to enforce unicode.

As I see, wxPhoenix is simply taking the wrong route.
It is simply not unicode comliant and corresponds only
to a personalized (and wrong) vision of the unicode world.
I am not aware of any core Python library working in that
way, accepting unicode and one form of encoded unicode.

Once again, to summarize, wxPhoenix is neither correctly
compliant with the unicode world, not with Python in the
spirit.

Totally in contradiction with the first message introducing
this thread: "...it clear that the future is Unicode
strings only..."

It's time to stop here. Robin asked for opinions,
I gave mine.

jmf

Werner · November 22, 2010, 4:25pm

Hi jmf,

Hi Werner,

Do you prefer do be forced to pass your German texts in a
utf-8 encoded form or restrict them to ascii.

Your German *str's* are probably already incompatible with
"wxPhoenix". You will be forced to decode/encode/transcode
them in a suitable form.

My Swiss German strings never make it into my soft;-) , all of it is in u'strings' with UK English texts and gets translated using poEdit/gettext into German and French ....., so my own stuff should be o.k. and if not it is a bug in my own stuff.

I was more thinking of people using a library lets say matplotlib with its wx backend, if one goes your route then this library might not be usable until the matplotlib maintainers update things, but I think Robin's approach would allow it to work.

Anyhow as you said I think the point is made.
Werner

···

On 22/11/2010 16:50, jmfauth wrote:

Jaakko_Salli · November 22, 2010, 4:59pm

Hi jmf,

To unicode or not to unicode

The problematic is much simpler, that it looks. People wish to make
unicode softfare, but they refused to use unicode.

For me, the main problem is that using the u-prefix in Python2 is a
pain. I have to do it quite a bit in my Python code, and it is effort
I would not want to extend to people who mostly use ASCII string
literals (i.e. those writing applications with English as the base
language, there has to be a few out there).

Besides, you speak of compatibility. Well, having to use u-prefixed
strings makes it even more difficult to write code that is compatible
with both Python2 and 3.

···

On 22.11.2010 8:55, jmfauth wrote:

On 22.11.2010 17:50, jmfauth wrote:

As I see, wxPhoenix is simply taking the wrong route.
It is simply not unicode comliant and corresponds only
to a personalized (and wrong) vision of the unicode world.
I am not aware of any core Python library working in that
way, accepting unicode and one form of encoded unicode.

I'm not sure if this counts, but I think Django does exactly
what Robin is suggesting for Project Phoenix:

Regards,
Jaakko

Jean-Michel_Fauth1 · November 22, 2010, 6:17pm

I was more thinking of people using a library lets say matplotlib with
its wx backend, if one goes your route then this library might not be
usable until the matplotlib maintainers update things, but I think
Robin's approach would allow it to work.

Python is exactly going this route. That's why I mentionned the
io module. In the io module, you can via its interface only exchange
unicodes and uniquely unicodes. I'm pretty sure, in the future, one
will see more and more modules working in that. This has technically
also the advantage to let all the encode/decode machinary outside
the library not in the library. You unicode should be ready *before*
using a library. In wxPhoenix, for example, a great part of all
the encoding/decoding stuff is done and included in the lib.
That's make the code havier and unsafe. A great part of my libs
are unicode ready and no one has to handle this encoding stuff.

On the wxdev mailing, there were even discussions about dropping
libs that are not using this way, like the codecs module (at least
a part of it)

Quickly, mathplotlib: the is to the lib to be Python compliant,
not the reverse way. wxPhoenix is not Python compliant.

···

----

Besides, you speak of compatibility. Well, having to use u-prefixed
strings makes it even more difficult to write code that is compatible
with both Python2 and 3.

In Python 3 str's are what unicode's are in Python. The 2to3 tool
is using exactly that feature to do a conversion. As soon as, you
have a str a or an encoded unicode, any tool is lost and can achieve
this translation.

A unicode in Python reflects exactly what a unicode is in the sense
of the Unicode Consortium, a chain of code points, a virtual object
which has no real materiality. As soon as you materialize it (in bits
by encoding), you go into troubles and you can not use them.

*This is the key point of the Unicode, in the sense Unicode
consortium, and this what most of you are not understanding*.

Exercise: add an € at the end of a unicode and at the
end of any encoded unicode, let say, utf-32-be, for the fun.

jmf

jmf,

Robin · November 22, 2010, 7:43pm

Thanks. It's good to know that there is a precedent for this idea.

···

On 11/22/10 8:59 AM, Jaakko Salli wrote:

I'm not sure if this counts, but I think Django does exactly
what Robin is suggesting for Project Phoenix:

Unicode data | Django documentation | Django

--
Robin Dunn
Software Craftsman

Robin · November 22, 2010, 11:01pm

I assume you are talking about the BOM here? If so then yes. I'm using Python's C APIs for converting a string object to a Unicode object, and the utf-8 codec does understand and deal with the BOM.

···

On 11/21/10 10:55 PM, jmfauth wrote:

PS2

Am I allowed to use marked utf-8 strs in "wxPhoenix"? As recognized
by http://unicode.org/ . Or will it be not even 'utf-8' compliant?

--
Robin Dunn
Software Craftsman

Robin · November 23, 2010, 3:00am

Ok, here is my decision:

1. The wxString typemap will accept Unicode objects or String objects for input parameters, and the string objects will be auto-converted to Unicode using the utf-8 codec, raising an exception if there is a decode error.

2. wxString return values or output parameters will always be converted to Python Unicode objects.

3. I'm still undecided on future directions for this, but I am leaning towards these plans:

3a. In some future release raise a deprecation warning when strings are auto-converted.

3b. In some future release after that only allow Unicode objects.

3c. In builds for Python 3.x only allow Unicode objects to be converted to wxStrings, and reserve our use of bytes objects for when we need a raw data buffer or similar. This is in line with the implied standard paradigms for the Python3 language.

I really tried to understand jmf's side of things on this subject but I just can't see the doom and gloom that he seems to be fearful of. We are still using all Unicode internally, but also allowing the convenience of ascii string literals and also the very common utf-8 if that is what the developer prefers and has an editor that supports it. And if the developer prefers working entirely with utf-8 then that is possible too. (This is actually very common on unix-like platforms, at least for C/C++ software. For example the GTK API is all utf-8 byte-strings. And if I understand correctly the CFString class used in the Mac APIs stores utf-8 internally.)

We can encourage people to only deal with non-unicode values at the "I/O Points" as mentioned by Cody, and to use unicode objects at all other places in the application. The Style Guide on the wiki might be a good place for this. But I don't think we'll be able to fully wean developers off of wanting to use string objects, at least not with Python 2.x, and I really don't like to impose my will on other developers by placing artificial roadblocks in places that don't really need it. I much prefer to allow the developers to make their own choices, even if that choice is to be lazy and take the easy route for using strings.

···

--
Robin Dunn
Software Craftsman
http://wxPython.org

Amaury · November 23, 2010, 7:34am

Hi,
Sorry to come so late in the discussion.

Ok, here is my decision:

I agree with everything, except for the default utf-8 encoding, for two reasons:
- wxWidgets itself proposes transparent conversion from bytes to
unicode, using the default locale, see
http://docs.wxwidgets.org/trunk/overview_unicode.html#overview_unicode_supportin
- utf-8 is almost never used anywhere on Windows.

wxWidgets already has its own idea of a default encoding. Why use
another convention?

Of course this whole discussion will be pointless once wxPython runs
on Python 3.2,
IMO this could happen before the "Phoenix" version!

···

2010/11/23 Robin Dunn <robin@alldunn.com>:

--
Amaury Forgeot d'Arc

Jean-Michel_Fauth1 · November 23, 2010, 9:01am

Long live "wxPhoenix"

···

=================================

On Nov 23, 8:34 am, "Amaury Forgeot d'Arc" <amaur...@gmail.com> wrote:

I agree with everything, except for the default utf-8 encoding, for two reasons:

- utf-8 is almost never used anywhere on Windows.

[Must be taken outside "wxPhoenix"]

Yes, this is the fundamental problem in the codings of characters.
There are only two universal codings, ascii and unicode [*].
As soon as one leaves the ascii world and one sticks in the
byte string world, one should consider *all* the codings
as valid of all as non valid.

This is what Python(2) is doing.

[*] unicode is not a coding, I'm just (wrongly) using this term by
commodity.

jmf

Chris_Barker1 · November 23, 2010, 10:49pm

Robin Dunn wrote:

Ok, here is my decision:

And to think I'd written most of a long and detailed note about this that I didn't get a chance to send!

Anyway, it all looks good, though I can't help myself:

1. The wxString typemap will accept Unicode objects or String objects for input parameters, and the string objects will be auto-converted to Unicode using the utf-8 codec, raising an exception if there is a decode error.

It's still not clear to me how likely it is that one could have a byte string in another common codec (latin-1, etc), that would decode successfully as utf-8 but not as the user expected. If that is anything but extremely unlikely, I think ascii is the only way to go --raising and exception early in the game is much better than errors later on.

I do think accepting ascii is a convenience worth having for the huge amounts of code written for python2, some before the unicode object even existed.

3a. In some future release raise a deprecation warning when strings are auto-converted.

3b. In some future release after that only allow Unicode objects.

3c. In builds for Python 3.x only allow Unicode objects to be converted to wxStrings,

All good.

I really tried to understand jmf's side of things on this subject but I just can't see the doom and gloom that he seems to be fearful of.

It all comes down to my question above.

Ironically, I'm pretty sure JMF continued to be a user and advocate for the ANSI builds, even once the unicode builds became available! Am I remembering that wrong, JMF?

> The very common utf-8 if

that is what the developer prefers and has an editor that supports it.

Is it really that common to use utf-8 in your source code, and thus for string literals? I sure don't (even though my editor does support it -- I always stick with ascii strings, and the escape sequences for non-ascii characters. It just seems too fragile -- too many editors don't have good unicode support, and there is no standard for specifying the encoding in text files.

Are there really a bunch of wxPythonistas doing this?

Amaury Forgeot d'Arc wrote:

- wxWidgets itself proposes transparent conversion from bytes to
unicode, using the default locale,

I think Robin addressed this -- while the locale idea sounds great, it's really a nightmare if you need to move code and apps between different developers and platforms. Every place I can think of that relies on a default locale has bit me or someone I work with in the butt.

- utf-8 is almost never used anywhere on Windows.

I'm not sure it matters where it is or isn't used on teh OS (filename, etc). What matters is how it is used or not in wxPython code.

utf-8 is compatible with ASCII -- I think suporting ASCII really is critical, so utf-8 is the only other option.

Of course this whole discussion will be pointless once wxPython runs
on Python 3.2,

yup.

-Chris

···

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Chris_Barker1 · November 24, 2010, 6:16pm

Folks,

It struck me that a key data point here is: what encodings are people using for existing python code? I assumed that non-ascii encoding was pretty rare, but then I saw a post on another python list about something completely different, and there at the top of some sample code was:

-*- coding: utf-8 -*-

which made me thing about it a bit. I also found this PEP:

http://www.python.org/dev/peps/pep-3120/

which proposes making utf-8 the standard for python source code. I can't tell what the status of that is, though.

PEP 8 (the style guide), however, says:

"Code in the core Python distribution should always use the ASCII or
Latin-1 encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond,
UTF-8 is preferred over Latin-1, see PEP 3120."

which gives a bit of a preference for accepting latin-1, rather than utf-8.

So I thought I'd do a little google experiment to see if we can see what it really used:

"-*- coding: utf-8 -*-" py : 3,540,000 results

"-*- coding: latin-1 -*-" py : 41,200 results

"-*- coding: iso-8859-15 -*-" py : 40,300 results

are there other ones to search for?

Anyway, not complete by any means, and the fact that some 2.* interpreters assume latin-1 for the default, not fair, but it does give some evidence that utf-8 is the one to support.

-Chris

···

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Jean-Michel_Fauth1 · November 24, 2010, 6:59pm

- I am firmly believing most of you do not understand what we are
commonly (and badly) calling "The Coding of Characters". The soul
and heart of the codings and consequently the usage of these codings.

- You do not realize deeply enough how well Python 2 is even better
reflecting this reality than Python 3.

- Chris Barker

Ironically, I'm pretty sure JMF continued to be a user and advocate for
the ANSI builds, even once the unicode builds became available! Am I
remembering that wrong, JMF?

You missed the point. I am deliberatly living and sticking in the
clean,
coherent and limited cp1252 world (also known as windows-1252).

There is no way today to use the wxPython-unicode in a clean manner,
100% coherently. Do you know why? Because of the StyledTextCtrl. And
do you
know why the StyleTextCtrl is the bottleneck? Because it is using
internally
the utf-8 coding.

- I think I have advocated more than everybody in this discussion the
usage of
Unicode and unicode.

- I'm sorry I cann't help more and do not know how to help more. In
this kind
of discussion, we are always falling back on same problematic, the
non
understanding of the codings of the characters.

jmf

Last minute, to Chris

The coding of the Python source code has nothing to do
with the codings of the strings used and defined in
this source code.

# -*- coding: rot13 -*-
cevag 'main'
h = h'Guvf vf n havpbqr'
f = 'This is a string'
cevag h
cevag f

Robin · November 24, 2010, 7:19pm

Robin Dunn wrote:

Ok, here is my decision:

And to think I'd written most of a long and detailed note about this
that I didn't get a chance to send!

Anyway, it all looks good, though I can't help myself:

1. The wxString typemap will accept Unicode objects or String objects
for input parameters, and the string objects will be auto-converted to
Unicode using the utf-8 codec, raising an exception if there is a
decode error.

It's still not clear to me how likely it is that one could have a byte
string in another common codec (latin-1, etc), that would decode
successfully as utf-8 but not as the user expected.

If I understand correctly it would be fairly rare. For single bytes < chr(128) latin-1 and similar also match ascii values. But multi-byte utf-8 sequences follow a set of rules that help identify invalid sequences. I'm sure it's possible to end up with successful utf-8 decodes of latin-1 text, but my gut feel is that there would be a decode error almost every time. On the other hand, going the other way (utf-8 text --> decoded using latin1) can easily successfully create garbage.

If that is anything
but extremely unlikely, I think ascii is the only way to go --raising
and exception early in the game is much better than errors later on.

Here's my take on it. I think that there are always going to be at least some people who are going to want to have a way to use more than just ascii in character string objects, but who are going to be resistant to using Unicode objects for whatever reason. Defaulting to the locale's default encoding was problematic, and telling them to "just use Unicode" will likely cause friction. By allowing auto-converts of utf-8 we are then able to tell them, "Use utf-8, it's an industry standard that is growing in popularity, works in any locale and is able to represent any code point you would ever want to use."

Or for another perspective, it's another check-box on a list of features that can be marked when comparing toolkits.

Ironically, I'm pretty sure JMF continued to be a user and advocate for
the ANSI builds, even once the unicode builds became available! Am I
remembering that wrong, JMF?

That was my recollection as well, although I think it had more to do with wxSTC using utf-8 internally in the Unicode build causing problems when trying to access the document buffer by byte offsets.

Is it really that common to use utf-8 in your source code, and thus for
string literals? I sure don't (even though my editor does support it --
I always stick with ascii strings, and the escape sequences for
non-ascii characters. It just seems too fragile -- too many editors
don't have good unicode support, and there is no standard for specifying
the encoding in text files.

Using the BOM at the beginning of a file is often used to indicate that it is utf-8 (or another utf encoding). I think I remember seeing something about this being supported even in Windows Notepad since Vista.

···

On 11/23/10 2:49 PM, Christopher Barker wrote:

--
Robin Dunn
Software Craftsman

Robin · November 24, 2010, 7:19pm

Folks,

It struck me that a key data point here is: what encodings are people
using for existing python code? I assumed that non-ascii encoding was
pretty rare, but then I saw a post on another python list about
something completely different, and there at the top of some sample code
was:

-*- coding: utf-8 -*-

which made me thing about it a bit. I also found this PEP:

PEP 3120 – Using UTF-8 as the default source encoding | peps.python.org

which proposes making utf-8 the standard for python source code. I can't
tell what the status of that is, though.

PEP 8 (the style guide), however, says:

"Code in the core Python distribution should always use the ASCII or
Latin-1 encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond,
UTF-8 is preferred over Latin-1, see PEP 3120."

which gives a bit of a preference for accepting latin-1, rather than utf-8.

So I thought I'd do a little google experiment to see if we can see what
it really used:

"-*- coding: utf-8 -*-" py : 3,540,000 results

"-*- coding: latin-1 -*-" py : 41,200 results

"-*- coding: iso-8859-15 -*-" py : 40,300 results

are there other ones to search for?

iso-8859-1 is equivalent to latin-1, and it has about 168,000 results. Much bigger but still pales in comparrison to utf-8.

···

On 11/24/10 10:16 AM, Christopher Barker wrote:

Anyway, not complete by any means, and the fact that some 2.*
interpreters assume latin-1 for the default, not fair, but it does give
some evidence that utf-8 is the one to support.

--
Robin Dunn
Software Craftsman

Chris_Barker1 · November 24, 2010, 8:09pm

- I am firmly believing most of you do not understand what we are
commonly (and badly) calling "The Coding of Characters". The soul
and heart of the codings and consequently the usage of these codings.

I'm sorry, but I don't really think misunderstanding is the issue here -- certainly not in Robin's case. It's a matter of priorities.

I understand your frustration with those of us that can do most of our work with ASCII -- there are a lot of people that put their fingers in their ears when the unicode discussion comes up: "I can't hear you!", they just don't want to deal with it.

But the participants in this discussion do want to deal with it, but also want to make the transition as painless as possible.

You missed the point. I am deliberatly living and sticking in the
clean,
coherent and limited cp1252 world (also known as windows-1252).

Isn't that a bit painful when you want to do stuff across platforms? (though is is very close to latin-1, as I understand it)

There is no way today to use the wxPython-unicode in a clean manner,
100% coherently. Do you know why? Because of the StyledTextCtrl. And
do you
know why the StyleTextCtrl is the bottleneck? Because it is using
internally
the utf-8 coding.

Which was a really poor choice. I have done a bit with it, and yes, it is a nightmare -- not only because it uses utf-8, but because it doesn't provide the methods required to work with it properly. A good unicode aware API would make it transparent to the user what encoding was used internally -- it would only matter to performance (speed and memory). Unicode in the STC is clearly an afterthought.

But that really has little to do with this discussion -- it's a topic for the Scintilla developers.

- I think I have advocated more than everybody in this discussion the
usage of
Unicode and unicode.

certainly -- which is why I was a bit confused -- I was pretty sure you were an ANSI-build user.

The coding of the Python source code has nothing to do
with the codings of the strings used and defined in
this source code.

It doesn't? I thought that's how we'd end up with non-ascii string literals. How about this "realistic" example:

# -*- coding: utf-8 -*-
print 'main'
s = "ï¿½"
print s
print len(s)
print [ord(c) for c in s]

And when run (If the utf-8 symbol came through email right)

main
ï¿½
2
[194, 176]

so -- that string literal is the two bytes that utf-8 uses for that degree symbol.

Following your example:

for ascii:

# -*- coding: ascii -*-
print 'main'
s = "c"
print s
print len(s)
print [ord(c) for c in s]

which results in (of course):

main
c
1
[99]

Now rot13:

# -*- encoding: rot13 -*-
cevag 'znva'
f = "p"
cevag f
cevag yra(f)
cevag [beq(p) sbe p va f]

and running it:

znva
p
1
[112]

112 is the rot13 value for "c", so it is storing teh encoded value in the byte string.

If I'm getting this right, this all means that if people write their source files in utf-8 (or any other encoding), and don't use unicode objects for literals, they will get utf-8 encoded strings that will get passed into wx.

Literals are the only reason I think is reasonable to accept raw strings into wx at all.

-Chris

···

On 11/24/10 10:59 AM, jmfauth wrote:

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Jean-Michel_Fauth1 · November 25, 2010, 9:24am

Period.

Correct. I removed the "wx" part because this is Python, not wx.

The coding directive on top of a Python source, is only here
to inform the interpreter which "language" we are speaking [*].

What does it let you thing, a user will be so stupid to type,
and work with encoded code points when he can work directly
working with code points?

To enter a "degree symbol", enter this u'\N{DEGREE SIGN}'
or this u'\u00b0' and certainly not this 'Â°' or this
'\xc2\xb0' (representations in iso-8859-1 and in ascii of
the utf-8 encoded form of the u'\u00b0' code point).

Anyway, the first thing you have to do before working later
with the two latter forms, is to decode them and create a
unicode code point.

Python is just reflecting the nature of Unicode. In unicode,
you should think "code points" and not "encoded code points".
(Technically it is just impossible, ty a regular expression
with a 'unicode' and a encoded unicode )

In XeTex, if I wish to enter the degree symbol in my source,
I just enter the code point with \char"00B0 or \nickname
(don't remember the name).

It is working like Python! Strange, isn't it?

jmf

[*] Even if the source contains a single comment.

This fails
# éléphant

This works
# -*- coding: cp1252 -*-
# éléphant

···

On Nov 24, 9:09 pm, Christopher Barker <Chris.Bar...@noaa.gov> wrote:

If I'm getting this right, this all means that if people write their
source files in utf-8 (or any other encoding), and don't use unicode
objects for literals, they will get utf-8 encoded strings.