ansi vs. unicode

Just a quick question about these two versions of wxPython. If I don't purposely take advantage of the Unicode features of the Unicode build of wxPython, does that mean that my program will run the same with either version?

I guess another way to ask it is, if I build a program with the ANSI version of wxPython, does that hinder it in any way even if I'm not trying to do anything special with Unicode characters?

In other words, to put it yet another way, is Unicode something you must use explicitly in order to get any use out of the Unicode build, otherwise they are the same?

I'm just trying to figure out which version I really need. I figure since I even have to ask the question, I could probably settle for ANSI, but I want to make sure that this doesn't handcuff me later, such as when I might want to switch to the Unicode version.

Thanks,
John

Just a quick question about these two versions of wxPython. If I don't
purposely take advantage of the Unicode features of the Unicode build of
wxPython, does that mean that my program will run the same with either
version?

Not necessarily. It doesn't take too much work to inadvertantly paste a
unicode character into some control from some other unicode-enabled
application (like your web browser, email client, etc.). On a unicode
build, that character will look as you expect it to, and will work just
fine. That is, until you try to save the content of that control to
disk. Then you get to have fun with the wonderful world of encodings.

Using an ANSI build, you'll probably either not be able to paste that
bit of unicode text, or when you do, any non-ascii characters will
probably be displayed as a garbage character of some kind (I've noticed
boxes generally). On the upside, you also won't need to bother with
encodings when trying to save data.

Either way, you may need to deal with encodings when *reading* data from
disk, if that data can come from disparate sources with/without
encodings, etc.

I guess another way to ask it is, if I build a program with the ANSI
version of wxPython, does that hinder it in any way even if I'm not
trying to do anything special with Unicode characters?

It depends.

In other words, to put it yet another way, is Unicode something you must
use explicitly in order to get any use out of the Unicode build,
otherwise they are the same?

If your software is open source, there are good odds that some
non-english-speaking user is going to pick it up and try to use it.
They will be putting in characters from their language, and if/when it
doesn't work, you will get a "please add unicode support" request.

I'm just trying to figure out which version I really need. I figure
since I even have to ask the question, I could probably settle for ANSI,
but I want to make sure that this doesn't handcuff me later, such as
when I might want to switch to the Unicode version.

As long as you are explicitly handling saving/loading in an
encoding-aware way, ANSI may be sufficient (don't load non-ascii files).
Earlier versions of PyPE didn't support unicode, saving, or loading with
a particular encoding. I eventually got a feature request, and added
the necessary support that is only run in Unicode builds. The current
version includes detection of encoding for Python coding: directives,
XML encoding declarations, and BOMs. If you download the source version,
it can be seen in pype.py:PythonSTC.SetText and GetText .

Depending on what you plan to do with the content, and/or/if you plan on
having any sort of persistance, you may need to deal with unicode and
encodings.

- Josiah

···

John Salerno <johnjsal@NOSPAMgmail.com> wrote:

Josiah Carlson wrote:

Depending on what you plan to do with the content, and/or/if you plan on
having any sort of persistance, you may need to deal with unicode and
encodings.

Thanks very much. One more question: is it ok to use the Unicode version, even if I don't deal with Unicode (just to be "safe")? Does the Unicode build cause any extra overhead or anything else that ANSI doesn't do, even if I don't use Unicode with it?

Thanks very much. One more question: is it ok to use the
Unicode version, even if I don't deal with Unicode (just to
be "safe")? Does the Unicode build cause any extra overhead
or anything else that ANSI doesn't do, even if I don't use
Unicode with it?

I think that we will have no choice in the near future, as IIUC Robin is planning to release only unicode from 2.7 on. However, Josiah has suggested a good recommendation: if you plan to release your work to non-ansi users (:-D), you will probably have to deal with unicode things. Maybe "saving" is not a feature you plan to use, but I always find that, in some way, I have to "load" something that the user specifies :wink:

Andrea.

···

_________________________________________
Andrea Gavana (gavana@kpo.kz)
Reservoir Engineer
KPDL
4, Millbank
SW1P 3JA London

Direct Tel: +44 (0) 20 717 08936
Mobile Tel: +44 (0) 77 487 70534
Fax: +44 (0) 20 717 08900
Web: http://xoomer.virgilio.it/infinity77
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Generally a slight memory increase during runtime, if because the
unicode dll is slightly larger, and because every native control and
unicode string will generally be representing every character internally
as 2 bytes rather than 1.

If you know what you are getting yourself into, I would suggest just
using the unicode version and figuring out what kind of persistance you
are going to need between runs (preferences, etc.), and making sure that
it is at least unicode agnostic (for preference saving, the miniconf
module works reasonably well:
miniconf · PyPI)

- Josiah

···

John Salerno <johnjsal@NOSPAMgmail.com> wrote:

Josiah Carlson wrote:

> Depending on what you plan to do with the content, and/or/if you plan on
> having any sort of persistance, you may need to deal with unicode and
> encodings.

Thanks very much. One more question: is it ok to use the Unicode
version, even if I don't deal with Unicode (just to be "safe")? Does the
Unicode build cause any extra overhead or anything else that ANSI
doesn't do, even if I don't use Unicode with it?

Gavana, Andrea wrote:

Thanks very much. One more question: is it ok to use the Unicode
version, even if I don't deal with Unicode (just to be "safe")?
Does the Unicode build cause any extra overhead or anything else
that ANSI doesn't do, even if I don't use Unicode with it?

I think that we will have no choice in the near future, as IIUC Robin
is planning to release only unicode from 2.7 on.

That's not definite yet, but I am certainly toying with the idea.

···

--
Robin Dunn
Software Craftsman
http://wxPython.org Java give you jitters? Relax with wxPython!

Robin Dunn wrote:

Gavana, Andrea wrote:

Thanks very much. One more question: is it ok to use the Unicode
version, even if I don't deal with Unicode (just to be "safe")?
Does the Unicode build cause any extra overhead or anything else
that ANSI doesn't do, even if I don't use Unicode with it?

I think that we will have no choice in the near future, as IIUC Robin
is planning to release only unicode from 2.7 on.

That's not definite yet, but I am certainly toying with the idea.

I think you should do it, that way I don't have to worry about choosing. :wink:

and because every native control and unicode string will generally be representing every character internally
as 2 bytes rather than 1.

This depends on if you choose utf-8 or utf-16. utf-8 only uses 1 byte to encode ascii chars where as utf-16 uses two bytes. The pros for utf-16 is that it uses less bytes for CJK languages, the pros for utf-8 is that you can convert ascii directly (the same hex value) and that it uses only 1 byte for ascii but for CJK it will use up to four bytes per character.

Rune,

reporting from the wonderful world of encoding.

···

On 9/19/06, Josiah Carlson jcarlson@uci.edu wrote:

John Salerno johnjsal@NOSPAMgmail.com wrote:

Josiah Carlson wrote:

Depending on what you plan to do with the content, and/or/if you plan on
having any sort of persistance, you may need to deal with unicode and
encodings.

Thanks very much. One more question: is it ok to use the Unicode
version, even if I don’t deal with Unicode (just to be “safe”)? Does the
Unicode build cause any extra overhead or anything else that ANSI

doesn’t do, even if I don’t use Unicode with it?

Generally a slight memory increase during runtime, if because the
unicode dll is slightly larger, and because every native control and
unicode string will generally be representing every character internally

as 2 bytes rather than 1.
If you know what you are getting yourself into, I would suggest just
using the unicode version and figuring out what kind of persistance you

are going to need between runs (preferences, etc.), and making sure that
it is at least unicode agnostic (for preference saving, the miniconf
module works reasonably well:

http://cheeseshop.python.org/pypi?:action=display&name=miniconf
)

  • Josiah

To unsubscribe, e-mail:
wxPython-users-unsubscribe@lists.wxwidgets.org
For additional commands, e-mail: wxPython-users-help@lists.wxwidgets.org

Note that I said "native control" and "unicode string". Not "encoded
unicode string". Unless one goes to extraordinary measures, Python is
compiled with a 2-byte per code point representation (UCS-2), Windows
uses 2-byte unicode characters (aslo UCS-2), and the underlying native
controls (in Windows and I believe wxGTK) also use 2-byte characters (in
UCS-2).

For writing to disk, you can certainly use utf-8 as an encoding to get
1-byte characters for many European code points, but that wasn't what I
was pointing out.

- Josiah

···

"Rune Devik" <rune.devik@gmail.com> wrote:

> and because every native control and unicode string will generally be
> representing every character internally
> as 2 bytes rather than 1.

This depends on if you choose utf-8 or utf-16. utf-8 only uses 1 byte to
encode ascii chars where as utf-16 uses two bytes. The pros for utf-16 is
that it uses less bytes for CJK languages, the pros for utf-8 is that you
can convert ascii directly (the same hex value) and that it uses only 1 byte
for ascii but for CJK it will use up to four bytes per character.

Yup, that is true :slight_smile:

  • Rune
···

On 9/19/06, Josiah Carlson jcarlson@uci.edu wrote:

“Rune Devik” rune.devik@gmail.com wrote:

and because every native control and unicode string will generally be
representing every character internally
as 2 bytes rather than 1.

This depends on if you choose utf-8 or utf-16. utf-8 only uses 1 byte to

encode ascii chars where as utf-16 uses two bytes. The pros for utf-16 is
that it uses less bytes for CJK languages, the pros for utf-8 is that you
can convert ascii directly (the same hex value) and that it uses only 1 byte

for ascii but for CJK it will use up to four bytes per character.

Note that I said “native control” and “unicode string”. Not “encoded
unicode string”. Unless one goes to extraordinary measures, Python is

compiled with a 2-byte per code point representation (UCS-2), Windows
uses 2-byte unicode characters (aslo UCS-2), and the underlying native
controls (in Windows and I believe wxGTK) also use 2-byte characters (in

UCS-2).

For writing to disk, you can certainly use utf-8 as an encoding to get
1-byte characters for many European code points, but that wasn’t what I
was pointing out.

  • Josiah

Josiah Carlson wrote:

Unless one goes to extraordinary measures, Python is
compiled with a 2-byte per code point representation (UCS-2), Windows
uses 2-byte unicode characters (aslo UCS-2),

It depends. Python can be built such that a Unicode character is either 2 bytes or 4 bytes. Most Pythons distributed with *nix distros will use the 4-byte option, although if you build Python yourself you will end up with the 2-byte option by default. Windows and OSX builds use the 2-byte option. You can tell what you have by looking at the sys.maxunicode value. If it is 65535 then your unicode chars are 2 bytes each. If it's something like 1114111 then they are 4 bytes.

and the underlying native
controls (in Windows and I believe wxGTK) also use 2-byte characters (in
UCS-2).

GTK uses utf-8 for everything.

In a Unicode build of wxWidgets/wxPython the wxString class will hold whatever the compiler's wchar_t type evaluates to. This can vary from platform to platform, and even from compiler to compiler. In practice though that's not a big deal for wxPython because there are functions in the Python C API that convert to/from wchar_t and whatever type Python is using for a Unicode char type, and if they happen to be the same then the functions are essentially a nop and have little overhead. So I use those functions when converting to/from wxString and Python Unicode objects and all is well.

···

--
Robin Dunn
Software Craftsman
http://wxPython.org Java give you jitters? Relax with wxPython!