Unicode confusion

By sheer hit-or-miss I found this code:

         self.wd = self.wd.encode('utf8')
         unicodeVowels = u"[ae\xc3\xa8iouy]+"
         uniConsonants = u"[^ae\xc3\xa8iouy]+"
         uVowel = sre.compile(unicodeVowels)
         uCons = sre.compile(uniConsonants) # (gets used later on)
         firstvowel = uVowel.search(self.wd).start()
         for v in uVowel.finditer(self.wd):
             lastvowel = v.end() # replaced for each group, last sticks
  . . .

self.wd is a Unicode string, because it's been parsed out of a string returned by StyledTextCtrl.GetLine() in the Unicode version of wxPython.

As far as I can tell (I haven't tried it on Windows or tested it extensively) this works; it identifies both 'i' and 'è' as vowels in 'twinèd'. But damned if I can understand *why* it works.

I don't quite understand why I need the first line -- and I'm not sure whether I may be messing something else up elsewhere (should I assign the encoded version to a new variable, not reuse self.wd?).

I also don't understand this disparity: in the WingIDE Python interpreter I can do this

  >>> unicodeVowels = u"[aeèiouy]+"
  >>> unicodeVowels

but if I try that first line in my *code* rather than in the interpreter, the debugger shows the value of 'unicodeVowels' as


Obviously I'm still very confused.

Charles Hartman

Charles Hartman
Professor of English, Poet in Residence
*the Scandroid* is available at: http://cherry.conncoll.edu/cohar/Programs

What if you replace those two lines with:
  unicodeVowels = u"[ae\xe8iouy]+"

By the way, why do you test only for "\xe8" and not other unicode
vowels ?


Le mardi 12 avril 2005 à 09:21 -0400, Charles Hartman a écrit :

By sheer hit-or-miss I found this code:

         self.wd = self.wd.encode('utf8')
         unicodeVowels = u"[ae\xc3\xa8iouy]+"