Encoding - to use/or not setdefaultencoding

Werner · October 25, 2010, 9:10pm

Hi all,

I start a new thread on this.

Why:)

I realize it is the recommendation by lots of posts I read years ago before
doing this, but there are also others who seem to be fine with it -
XML File Operations with Python - Read, Write and Parse XML Data

The alternative is to have "encode/decode" all over the place.

Hi,
No, when working with a Unicode application you should only do
conversions on input/output points and use Unicode everywhere
internally.

Agreed.

Either way one still has to pay attention to the encoding and think about
it.

I use it in my application for a few years now (since I switched to use the
Unicode build of wxPython and have all my character stuff defined as unicode
and all my script files have "# -*- coding: utf-8 -*-#") and it hasn't
caused me any problems but reduced the number of places where I needed to
use encode/decode.

The '#-*- coding...' line has nothing to do with how strings and bytes
are interpreted within your application. That line is only to tell the
interpreter how to handle the text in your script.

That is confusing to me. I thought it defines the encoding of the .py file, i.e. any string/constant/remark and IIRC one also has to ensure that the editor one is using uses the same encoding otherwise it can get confusing.

What is "ListCtrlPrinter"?

It is a wrapper of wx.Printout in ObjectListView.

I don't think that this is a standard
wxPython class. But my guess would be that may be passing raw Unicode
bytes to whatever is being used to create the pdf, where that code is
expecting an encoded string.

The modified code I had posted worked for me on Windows/Py 2.6/wxPython 2.8, i.e. the accented characters were correct on the monitor, in preview, on print output and viewing the PDF but the OP still had a problem but he is on *nix.

Werner

···

On 25/10/2010 20:57, Cody Precord wrote:

Robin · October 25, 2010, 9:36pm

Yes, but I think it's just to specify the encoding of unicode string literals. IOW, how do you convert the foo in u"foo" to a unicode value at compile time, so the value embedded in the byte-code for that literal will be a unicode object. See section 2.2.3 at 2. Using the Python Interpreter — Python 3.13.0 documentation

The value returned by sys.getdefaultencoding() is what is used by default to convert to/from string and unicode values when coerced by type specific code. For example

str(unicode_value)
s = "converted to string: %s" % unicode_value

And also calls out to extension module functions that use APIs like PyArg_ParseTuple and specify that either a string or a unicode type is expected.

···

On 10/25/10 2:10 PM, werner wrote:

The '#-*- coding...' line has nothing to do with how strings and bytes
are interpreted within your application. That line is only to tell the
interpreter how to handle the text in your script.

That is confusing to me. I thought it defines the encoding of the .py
file, i.e. any string/constant/remark and IIRC one also has to ensure
that the editor one is using uses the same encoding otherwise it can get
confusing.

--
Robin Dunn
Software Craftsman

Jean-Michel_Fauth1 · October 26, 2010, 1:59pm

The value returned by sys.getdefaultencoding() is what is used by
default to convert to/from string and unicode values when coerced by
type specific code. For example
 str\(unicode\_value\)
 s = &quot;converted to string: %s&quot; % unicode\_value

And probably in absurde cases like this:

'abcé'.encode('utf-8')

Traceback (most recent call last):
File "<psi last command>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in
position 3: ordinal not in range(128)

but this logically works:

'abc'.encode('utf-8')

abc

jmf

···

On Oct 25, 11:36 pm, Robin Dunn <ro...@alldunn.com> wrote:

Bostjan_Mejak1 · October 26, 2010, 2:07pm

You need to declare the encoding in the module (let it be written on the first line – or on the second line if #! /usr/bin/env python is declared as well, which needs to be at the very top) by # coding=utf-8 or # -- coding: utf-8 --

Then just add a ‘u’ in front of the string literal, like this: u’abcé’

···

On Tue, Oct 26, 2010 at 3:59 PM, jmfauth wxjmfauth@gmail.com wrote:

On Oct 25, 11:36 pm, Robin Dunn ro...@alldunn.com wrote:

The value returned by sys.getdefaultencoding() is what is used by

default to convert to/from string and unicode values when coerced by

type specific code. For example
 str(unicode_value)
 s = "converted to string: %s" % unicode_value

And probably in absurde cases like this:

‘abcé’.encode(‘utf-8’)

Traceback (most recent call last):

File “”, line 1, in

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe9 in

position 3: ordinal not in range(128)

but this logically works:

‘abc’.encode(‘utf-8’)

abc

jmf

–

To unsubscribe, send email to wxPython-users+unsubscribe@googlegroups.com

or visit http://groups.google.com/group/wxPython-users?hl=en

Karsten_Hilbert · October 26, 2010, 2:27pm

What exactly makes you think this is absurd ? It seems quite
logical ?

Karsten

···

On Tue, Oct 26, 2010 at 06:59:47AM -0700, jmfauth wrote:

And probably in absurde cases like this:

>>> 'abcï¿½'.encode('utf-8')
Traceback (most recent call last):
File "<psi last command>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in
position 3: ordinal not in range(128)

--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

Jean-Michel_Fauth1 · October 26, 2010, 2:37pm

You need to declare the encoding in the module (let it be written on the
first line -- or on the second line if #! /usr/bin/env python is declared
as well, which needs to be at the very top) by # coding=utf-8 or # -*-
coding: utf-8 -*-

No. Robin Dunn has already replied to such a comment here

http://groups.google.com/group/wxpython-users/browse_thread/thread/47271bf9ac394f1e#

Then just add a 'u' in front of the string literal, like this: u'abcé'

This is correct, but, no offense, you are not understanding what we
are
discussing about.

jmf

···

On Oct 26, 4:07 pm, Boštjan Mejak <bostjan.me...@gmail.com> wrote:

Jean-Michel_Fauth1 · October 26, 2010, 2:45pm

Because, 'abc' is a string of type <str>. It *has* a coding,
it *is* in a coding format, but it can not *be encoded*.

Only strings of type <unicode> can be encoded.

jmf

···

On Oct 26, 4:27 pm, Karsten Hilbert <Karsten.Hilb...@gmx.net> wrote:

On Tue, Oct 26, 2010 at 06:59:47AM -0700, jmfauth wrote:
> And probably in absurde cases like this:

> >>> 'abc '.encode('utf-8')
> Traceback (most recent call last):
> File "<psi last command>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in
> position 3: ordinal not in range(128)

What exactly makes you think this is absurd ? It seems quite
logical ?

Karsten_Hilbert · October 26, 2010, 3:13pm

> > And probably in absurde cases like this:
>
> > >>> 'abc '.encode('utf-8')
> > Traceback (most recent call last):
> > ï¿½ File "<psi last command>", line 1, in <module>
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in
> > position 3: ordinal not in range(128)
>
> What exactly makes you think this is absurd ? It seems quite
> logical ?
>

Because, 'abc' is a string of type <str>. It *has* a coding,

Namely either sys.getdefaultencoding() or of the encoding
that was put at the top of the file into the coding
directive. That's the knack.

it *is* in a coding format, but it can not *be encoded*.

Aha, I see.

Does the file you see this in have a coding directive ? I
would assume it doesn't. If that's true the following
happens:

- python sees the string in the file
- python sees the request for "turning" it into utf8
- python searches for the *current* encoding of the string
- python does not find anything at the top of the file
- python looks at sys.getdefaultencoding
- python finds "ascii"
- python tries to (internally)

- turn the (supposedly) "ascii"-encoded string into unicode by
doing 'abc-strange_e'.decode('ascii') (which, of course, fails)

- because it needs the unicode-version thereof to turn
*that* into utf8

Does that make sense ?

Karsten

···

On Tue, Oct 26, 2010 at 07:45:16AM -0700, jmfauth wrote:

On Oct 26, 4:27ï¿½pm, Karsten Hilbert <Karsten.Hilb...@gmx.net> wrote:
> On Tue, Oct 26, 2010 at 06:59:47AM -0700, jmfauth wrote:

--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

Robin · October 26, 2010, 7:30pm

Does the file you see this in have a coding directive ? I
would assume it doesn't. If that's true the following
happens:

- python sees the string in the file
- python sees the request for "turning" it into utf8

- python searches for the *current* encoding of the string
- python does not find anything at the top of the file

If my understanding is correct then those two steps do not actually happen. The coding specification at the top of the file is used at compile time, not run time, and string objects do not have a "current encoding", they are just a series of bytes and Python does not keep track of any encoding information about them. The only time Python knows what non-default encoding to use for converting a string object to a unicode object is when you specify it by passing the name to decode(), otherwise it uses the default.

···

On 10/26/10 8:13 AM, Karsten Hilbert wrote:

- python looks at sys.getdefaultencoding
- python finds "ascii"
- python tries to (internally)

  - turn the (supposedly) "ascii"-encoded string into unicode by
    doing 'abc-strange_e'.decode('ascii') (which, of course, fails)

- because it needs the unicode-version thereof to turn
   *that* into utf8

--
Robin Dunn
Software Craftsman

Karsten_Hilbert · October 26, 2010, 7:58pm

>- python sees the string in the file
>- python sees the request for "turning" it into utf8

>- python searches for the *current* encoding of the string
>- python does not find anything at the top of the file

If my understanding is correct then those two steps do not actually
happen. The coding specification at the top of the file is used at
compile time, not run time, and string objects do not have a "current
encoding",

I should have written:

- python searches for the *current* encoding of the string
- python does not find anything at the top of the file
- python looks at sys.getdefaultencoding

which is more correct but still wrong

they are just a series of bytes and Python does not keep
track of any encoding information about them.

The only time Python
knows what non-default encoding to use for converting a string object
to a unicode object is when you specify it by passing the name to
decode(), otherwise it uses the default.

That explains it, IMO, at any rate.

Karsten

···

On Tue, Oct 26, 2010 at 12:30:08PM -0700, Robin Dunn wrote:
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346