unicode handling

Thomas5 · August 3, 2006, 10:38pm


On Thu, 3 Aug 2006 17:04:08 +1200, "Thomas Thomas" <thomas@mindz-i.co.nz> wr
>Then you need to figure out what encoding the file is using.

how do I do this..


> > it will be as starightforward as copying and pasting the content below
> onto notepad
> ------------------------
> string MetaDataPrompt = "Discovery No";
> string MetaDataFieldName = "Discovery No";
> string MetaDataType = "string";
> string MetaDataValue = "£500";
> string MetaDataPrompt = "comments";
> string MetaDataFieldName = "Comments";
> string MetaDataType = "string";
> string MetaDataValue = "Energy Scope £500";
> -----------------------------------------------------
> > and try reading it from that file..
>If you do that, you will have an 8-bit file encoded with whatever your
>system's default encoding is.

>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
>>>

>>> d=u'ENERGY SCOPE \xa3500'
>>> c = 'ENERGY SCOPE \xa3500'
>>> c.decode('latin-1') == d
True
>>>

Thanks Josiah. this will work for me


>Because of that, Python, by default, does not assume an encoding. When
>it encounters a byte outside of the standard ASCII range (0-127), it pukes.
>It is quite likely that your file is iso-8859-1. Try:
> inifile = codec.open(filename, 'r', encoding='iso-8859-1')

Thanks tim this works fine.. But I still cant understand why . bcz

the system says the default encoding is ascii and

inifile = codec.open(filename, 'r', encoding='latin-1')

works fine as well.giving me the desired results ..

text = f.read().split('\n')

>>> text
[u'string MetaDataPrompt = "Discovery No";\r', u'\r', u'string MetaDataFieldName = "Discovery No";\r', u'\r', u'string MetaDataType = "string";\r', u'\r', u'string MetaDataValue = "\xa3500";\r', u'\r', u'string MetaDataPrompt = "comments";\r', u'\r', u'string MetaDataFieldName = "Comments";\r', u'\r', u'string MetaDataType = "string";\r', u'\r', u'string MetaDataValue = "Energy Scope \xa3500";\r', u'']
>>>

I was thinking u'' stands for unicode then how come when i tried 'iso-8859-1' and 'latin-1' python giving me list of unicode encoded values. Dont we have to use utf-8 or 16 for that.

or to be more simple how both this will work

>>> a='string MetaDataValue = "Energy Scope \xa3500";\r'
>>> b='string MetaDataValue = "Energy Scope \xa3500";\r'
>>> a==b
True
>>> b=u'string MetaDataValue = "Energy Scope \xa3500";\r'
>>> a==b
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 37: ordinal not in range(128)
>>> a.decode('latin-1')==b
True
>>> a.decode('iso-8859-1')==b
True
>>>

finally I want to know which is best way to proceed in cases like this. I will think I find the encoding of the file and try opening the file using that.

how do I do that.

Thank you very much

···

Thomas Thomas
thomas@mindz-i.co.nz
Phone. +64 7 855 8478
Fax. +64 7 855 8871

Josiah_Carlson · August 3, 2006, 11:13pm

Indeed, u'' is a unicode string, but that is just an internal Python
representation, that happens to be UCS2 (similar, if not identical to,
UTF-16). When you write to disk, you need to encode the 2/4-byte
characters into an underlying representation that is byte-oriented in
nature.

Note that iso-8859-1 is the same codec as latin-1.
u'...'.encode('latin-1') will succeed only if the contents of the
unicode string are within the values 0...255, anything beyond will raise
an exception.

Ultimately, I believe you would be better off discovering the original
codec used to write to disk, and always using that. Likely one of the
codecs listed in the standard encodings package is the correct one:

http://docs.python.org/lib/standard-encodings.html

Also, I don't know if others have been experiencing the same issue, but
your mail client seems to have issues producing plain text emails
(line endings, etc., seem to be garbled).

- Josiah

···

"Thomas Thomas" <thomas@mindz-i.co.nz> wrote:

Thanks tim this works fine.. But I still cant understand why . bcz
the system says the default encoding is ascii and
inifile = codec.open(filename, 'r', encoding='latin-1')
works fine as well.giving me the desired results ..
text = f.read().split('\n')
>>> text
[u'string MetaDataPrompt = "Discovery No";\r', u'\r', u'string MetaDataFieldName = "Discovery No";\r', u'\r', u'string MetaDataType = "string";\r', u'\r', u'string MetaDataValue = "\xa3500";\r', u'\r', u'string MetaDataPrompt = "comments";\r', u'\r', u'string MetaDataFieldName = "Comments";\r', u'\r', u'string MetaDataType = "string";\r', u'\r', u'string MetaDataValue = "Energy Scope \xa3500";\r', u'']
>>>

I was thinking u'' stands for unicode then how come when i tried
'iso-8859-1' and 'latin-1' python giving me list of unicode encoded
values. Dont we have to use utf-8 or 16 for that.

Henning_Hraban_Ramm · August 4, 2006, 6:08pm

PLEASE read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software
or something similar.

Greetlings from Lake Constance!
Hraban

···

Am 2006-08-04 um 00:38 schrieb Thomas Thomas:

>Then you need to figure out what encoding the file is using.
how do I do this..
> inifile = codec.open(filename, 'r', encoding='iso-8859-1')
Thanks tim this works fine.. But I still cant understand why . bcz

---
http://www.fiee.net
http://www.cacert.org (I'm an assurer)