By sheer hit-or-miss I found this code:
self.wd = self.wd.encode('utf8')
unicodeVowels = u"[ae\xc3\xa8iouy]+"
uniConsonants = u"[^ae\xc3\xa8iouy]+"
uVowel = sre.compile(unicodeVowels)
uCons = sre.compile(uniConsonants) # (gets used later on)
firstvowel = uVowel.search(self.wd).start()
for v in uVowel.finditer(self.wd):
lastvowel = v.end() # replaced for each group, last sticks
. . .
self.wd is a Unicode string, because it's been parsed out of a string returned by StyledTextCtrl.GetLine() in the Unicode version of wxPython.
As far as I can tell (I haven't tried it on Windows or tested it extensively) this works; it identifies both 'i' and 'è' as vowels in 'twinèd'. But damned if I can understand *why* it works.
I don't quite understand why I need the first line -- and I'm not sure whether I may be messing something else up elsewhere (should I assign the encoded version to a new variable, not reuse self.wd?).
I also don't understand this disparity: in the WingIDE Python interpreter I can do this
>>> unicodeVowels = u"[aeèiouy]+"
>>> unicodeVowels
u'[ae\xc3\xa8iouy]+'
>>>
but if I try that first line in my *code* rather than in the interpreter, the debugger shows the value of 'unicodeVowels' as
u"[ae\x8fiouy]+"
Obviously I'm still very confused.
Charles Hartman
Charles Hartman
Professor of English, Poet in Residence
*the Scandroid* is available at: http://cherry.conncoll.edu/cohar/Programs
http://villex.blogspot.com