Unicode hell.

Hi All,

I'm trying to write a function to 'clean up' text pasted from, say, MS Word so that angled quotation marks, emdashes and so on are replaced with web-safe characters (ASCII and HTML entity codes).

I've set the encoding as latin-1 in as many places as I can think of.

1. I added the following to the top of the file:

#!/usr/bin/python
# -*- coding: latin-1 -*-
# coding=latin-1

2. In the MainWindow class, I set:

wx.SetDefaultPyEncoding('iso-9959-1')

3. In wx.Font() for the main TextCtrl, I set:

encoding = wx.FONTENCODING_ISO8859_1

Here's what has me befuddled: if I paste an angled open quotation mark into IDLE (with the encoding also set to latin-1), it converts the mark to '\x93'.

However, when I do the same in my wxpython app, it converts the mark to '\u201c' and returns the following UnicodeEncodeError:

"'latin-1' codec can't encode character '\u201c' in position 0: ordinal not in rnage(256)

What on earth am I doing wrong?

Thanks in advance for any help.

Regards,
Ryan

Hello,

  1. I added the following to the top of the file:

#!/usr/bin/python

-- coding: latin-1 --

coding=latin-1

This only sets the encoding that python will see you source code with. It has nothing to do with text that is imported in and out of your program during run time.

Here’s what has me befuddled: if I paste an angled open quotation mark into IDLE (with the encoding also set to latin-1), it converts the mark to ‘\x93’.

However, when I do the same in my wxpython app, it converts the mark to ‘\u201c’ and returns the following UnicodeEncodeError:

"‘latin-1’ codec can’t encode character ‘\u201c’ in position 0: ordinal not in rnage(256)

What on earth am I doing wrong?

This error message would suggest that you are doing some operation between a string and a unicode object. So my first suggestion would be to make sure you are actually working with unicode. (doing something like ‘print type(var)’ should show you whats being used.

What build of wxpython are you using (Unicode/Ascii)?

What type of control are you pasting into? Is the error in the actual paste or when you are trying to process the input?

If you have some sample code that can reproduce your problem it will be easier to give you more detailed advice.

Cody

···

On Thu, Feb 19, 2009 at 2:48 PM, Ryan McGreal editor@raisethehammer.org wrote:

A couple more notes:

> I've set the encoding as latin-1 in as many places as I can think of.

latin-1 is NOT unicode -- it is an 8-bit encoding that is a superset of ascii. Fine for your code files, but...

I'm trying to write a function to 'clean up' text pasted from, say, MS Word

I'm pretty sure Word exports unicode -- I think how it works is that when you paste, the app that is being pasted INTO (wx in this case) asks for a unicode version of the string. So if you use a unicode build of wx, things should "just work"

And if Word isn't giving you unicode, it's probably not giving you latin-1 either, but rather whatever MS's latin encoding is , and I think it's slightly different.

2. In the MainWindow class, I set:
wx.SetDefaultPyEncoding('iso-9959-1')

> 3. In wx.Font() for the main TextCtrl, I set:
>
> encoding = wx.FONTENCODING_ISO8859_1

I don't think you want that either -- just use unicode everywhere except IO.

Here's what has me befuddled: if I paste an angled open quotation mark into IDLE (with the encoding also set to latin-1), it converts the mark to '\x93'.

However, when I do the same in my wxpython app, it converts the mark to '\u201c' and returns the following UnicodeEncodeError:

anyone know what the unicode code=point is for that open quote?

"'latin-1' codec can't encode character '\u201c' in position 0: ordinal not in rnage(256)

What on earth am I doing wrong?

trying to use latin-1. If you want your html to be in latin-1, do everything else first (converting to html escapes, etc.), then convert to latin-1 as the last step.

-Chris

···

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov