A lot of my programming involves working with text that contains accented characters (specifically, French text). Sometimes the text comes from a TextCtrl or other wx widget, sometimes it comes from a file. It’s all the same to me but I find that some is unicode and some is not and this causes me endless problems when I want to test whether two texts are the same or to manipulate text. I know I can convert one type to the other, e.g., unicode to encoded with .encode() and vice versa with .decode(), but this is an ad hoc mess that I’m continually having to patch up when problems arise.
A simple example: To convert a text x to lower case, where x might be either an encoded string or unicode, I ended up writing:
try:
y = x.lower()
except:
y = x.lower().encode('cp1252')
The try clause succeeds with some texts and the except clause succeeds with the others; neither clause succeeds with any text.
Does someone know a better, more general, way to deal with unicode vs encoded strings? Is the problem that wx is unicode and Python 2 isn’t?
(I’m using wx.Python 2.8.11.0, wxMSW, unicode, with Python 2.7. The first line in my Python programs is “# -- coding: cp1252 --”.)
Then in your program only use unicode, no more strings! I.e. all should be u'something' and not 'something'.
All input/output should do the necessary encode/decode from/to whatever to utf-8.
In my app I also do this in my app_base.py (inherits from wx.App and InspectionMixin):
# people say one should leave this alone and use decode/encode, or define this
# in sitecustomize.py
# either of them don't really work for me, so as long as the following does
# this is what I will do until I switch to Py 3.x
if hasattr(sys, "frozen"): #Py2Exe does not run Site.py
sys.setdefaultencoding('utf-8')
del sys.setdefaultencoding
else: #The Python interpreter needs to reload the function
reload(sys)
sys.setdefaultencoding('utf-8')
del sys.setdefaultencoding
As mentioned in the comment above I have seen posts doing some googling where they strongly recommend against doing this, but I found this easier and it got rid of some of the errors I run into it when first switching to Unicode (some years ago - so don't remember exact errors, sorry).
Werner
···
On 10/20/2011 04:44 AM, Patrick Maher wrote:
A lot of my programming involves working with text that contains accented characters (specifically, French text). Sometimes the text comes from a TextCtrl or other wx widget, sometimes it comes from a file. It's all the same to me but I find that some is unicode and some is not and this causes me endless problems when I want to test whether two texts are the same or to manipulate text. I know I can convert one type to the other, e.g., unicode to encoded with .encode() and vice versa with .decode(), but this is an ad hoc mess that I'm continually having to patch up when problems arise.
A simple example: To convert a text x to lower case, where x might be either an encoded string or unicode, I ended up writing:
try:
y = x.lower()
except:
y = x.lower().encode('cp1252')
The try clause succeeds with some texts and the except clause succeeds with the others; neither clause succeeds with any text.
Does someone know a better, more general, way to deal with unicode vs encoded strings? Is the problem that wx is unicode and Python 2 isn't?
(I'm using wx.Python 2.8.11.0, wxMSW, unicode, with Python 2.7. The first line in my Python programs is "# -*- coding: cp1252 -*-".)
Belated thanks to Werner for his excellent advice, which I’ve tried to follow. But I’ve run into one problem: It appears that, although Pickle accepts unicode keys, Shelve doesn’t. This is demonstrated in the attached little program. Does anyone know a way to use Shelve with unicode keys?
1st I had the same issue, but modified as follows it work:
# -*- coding: utf-8 -*-
# NB: "coding" line MUST be either the 1st or the 2nd one of the file
#Define test data
testdic = {}
########testdic[u'1'] = 'This is item 1'
testdic['bêtà'] = 'This is item 1 àçéèö'
#Save it with pickle
import pickle
f = open('test.pkl','w')
pickle.dump(testdic, f)
f.close()
print "pickle completed"
#Try to save it with shelve (doesn't work because key is unicode)
import shelve
db = shelve.open('test.shv')
for key in testdic:
db[key] = testdic[key]
db.close()
print "shelve completed"
JY
···
On Wed, 21 Dec 2011 12:29:19 -0800 (PST) Patrick Maher <patrick@maher1.net> wrote:
Shelve simply does not work with unicode keys. It this
has been fixed in Python 3, there will be no fix for
Python 2. Workaround, encode the unicode keys as bytes
(<type 'str'>), in utf-8 or cp1252. If I remember correctly,
you are using Windows.
The "sys.defaultencoding" should be 'ascii' for Python 2
and utf-8 for Python 3. If not, Python may not work
properly (esp. true for hashing, dicts with unicode keys).
The coding directive on top of a .py script is only here
to inform the Python parser. It has no impact on the intrinsic
coding of a <type 'str'> you may use in your code.
jmf
···
On 21 déc, 21:29, Patrick Maher <patr...@maher1.net> wrote:
Belated thanks to Werner for his excellent advice, which I've tried to
follow. But I've run into one problem: It appears that, although Pickle
accepts unicode keys, Shelve doesn't. This is demonstrated in the attached
little program. Does anyone know a way to use Shelve with unicode keys?