Unicode files with StyledTextCtrl

letij · June 20, 2008, 10:49am

Hello!

How can I display unicode formated files properly in a StyledTextCtrl?

self.SetCodePage( wx.stc.STC_CP_UTF8 )
self.LoadFile( path )

this displays the characters "ï»¿" (UTF-8 BOM) at the beginning.

file = open( path, 'r' )
self.SetTextUTF8( file.read() )
file.close()

this displays something like an ' as the very first character, which causes
the file to be empty on self.SaveFile( path ).

letij · June 25, 2008, 12:52pm

BTW, is the file saved as Unicode or is it utf-8? They are not the same
thing and if you try to get the STC to load unicode as if it was utf-8
then you'll have lots of problems.

It's UTF-8
If I remove the BOM manually and save the file via self.SaveFile( path ),
Notepad++ tells me that the file is ANSI.

Josiah_Carlson1 · June 25, 2008, 1:34pm

You have to manually strip the BOM from the file during load and
re-add it on save.

- Josiah

···

On Wed, Jun 25, 2008 at 5:52 AM, letij <e6d8wtu02@sneakemail.com> wrote:

BTW, is the file saved as Unicode or is it utf-8? They are not the same
thing and if you try to get the STC to load unicode as if it was utf-8
then you'll have lots of problems.

It's UTF-8
If I remove the BOM manually and save the file via self.SaveFile( path ),
Notepad++ tells me that the file is ANSI.

letij · June 25, 2008, 1:41pm

Thanks for the help! It appears that the standard load and save method of
StyledTextControl is buggy.

This is my solution:

import codecs

...

def Load( self, path ):
  file = codecs.open( path, 'r', 'UTF-8' )
  bom = file.read( 1 )
  if bom != unicode( codecs.BOM_UTF8, 'utf8' ):
    self.SetText( bom )
  self.AppendText( file.read() )
  file.close()

def Save( self, path ):
  file = codecs.open( path, 'w', 'UTF-8' )
  file.write( unicode( codecs.BOM_UTF8, 'utf8' ) )
  file.write( self.GetText() )
  file.close()

see also:
http://evanjones.ca/python-utf8.html
http://www.daniweb.com/forums/thread110242.html

Josiah_Carlson1 · June 25, 2008, 1:49pm

The standard load/save method is not buggy, it works great; I've been
using it to load/save unicode text in utf-8, utf-16 (and on platforms
that support it, utf-32) for years, never mind the various latin-*
encodings. It's just a matter of understanding what's going on.

In the case of Python, one character does not a utf-8 BOM make. It's
3 characters for utf-8. And if you really want to support unicode,
you should check for all BOM markers, decode your data based on what
is what, and use the SetText() with your decoded unicode (you don't
need to pass utf-8 encoded data to the STC, giving it an instance of
the actual Python unicode type works just fine).

- Josiah

···

On Wed, Jun 25, 2008 at 6:41 AM, letij <e6d8wtu02@sneakemail.com> wrote:

Thanks for the help! It appears that the standard load and save method of
StyledTextControl is buggy.

This is my solution:

import codecs

...

def Load( self, path ):
       file = codecs.open( path, 'r', 'UTF-8' )
       bom = file.read( 1 )
       if bom != unicode( codecs.BOM_UTF8, 'utf8' ):
               self.SetText( bom )
       self.AppendText( file.read() )
       file.close()

def Save( self, path ):
       file = codecs.open( path, 'w', 'UTF-8' )
       file.write( unicode( codecs.BOM_UTF8, 'utf8' ) )
       file.write( self.GetText() )
       file.close()

see also:
http://evanjones.ca/python-utf8.html
wxpython | DaniWeb

_______________________________________________
wxpython-users mailing list
wxpython-users@lists.wxwidgets.org
http://lists.wxwidgets.org/mailman/listinfo/wxpython-users