Unicode hell.

Tim_Roberts · July 12, 2019, 6:29pm

I'm trying to write a function to 'clean up' text pasted from, say, MS
Word so that angled quotation marks, emdashes and so on are replaced
with web-safe characters (ASCII and HTML entity codes).

I've set the encoding as latin-1 in as many places as I can think of.

How are you getting the string, exactly?

1. I added the following to the top of the file:

#!/usr/bin/python
# -*- coding: latin-1 -*-
# coding=latin-1

2. In the MainWindow class, I set:

wx.SetDefaultPyEncoding('iso-9959-1')

You know that should be 8859, not 9959, right?

Here's what has me befuddled: if I paste an angled open quotation mark
into IDLE (with the encoding also set to latin-1), it converts the
mark to '\x93'.

However, when I do the same in my wxpython app, it converts the mark
to '\u201c' and returns the following UnicodeEncodeError:

"'latin-1' codec can't encode character '\u201c' in position 0:
ordinal not in rnage(256)

If you have a Unicode string containing that, it would work. That is,
u"\u201c" is a left double quotation mark, which happens to be the same
as "\0x93" in the Windows 1252 encoding. That symbol is NOT present in
either Latin-1 or ISO-8859-1, so the errors are correct.

What on earth am I doing wrong?

Well, what do you want" Where are you trying to save this? You can
work with the Unicode string internally, but when you save it somewhere,
you'll have to decide how to encode it.

···

On Thu, 19 Feb 2009 14:48:35 -0600,"Ryan McGreal" <editor@raisethehammer.org> wrote:

--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.