StyledTextCtrl SetStyling character count

When I use StartStyling() or SetStyling() functions, the result is wrong for characters other than basic Latin (a. k. a. ASCII). In the following output, produced by the example code below, the first line is OK (each word has a separate style); in the second line I replaced the second word by a random Cyrillic word, and the styles got messed up:

Untitled

By the look of it, SetStlying()'s length parameter behaves as if it is the number of bytes, not characters, of a UTF-8 string: for Latin characters those are equal, for others are not. If I use the number of bytes instead of the number of characters, it starts working (see the third line of the output above), but that doesn’t feel like a proper solution. Is there a better way to handle that, by somehow telling the control to treat UTF-8 as UTF-8 everywhere, or something? (StartStyling() is not present in the example but behaves the same.)

import wx
import wx.stc as stc

class MainWindow(wx.Frame):
    def __init__(self, parent, id_, title, size):
        wx.Frame.__init__(self, parent, id_, title, size=size)

        self.content = stc.StyledTextCtrl(self)

        self.sizer = wx.BoxSizer(wx.VERTICAL)
        self.sizer.Add(self.content, 1, wx.EXPAND)
        self.SetSizer(self.sizer)

        self.content.StyleSetSpec(stc.STC_STYLE_DEFAULT, f"size:14,fore:#0000FF")
        self.content.StyleClearAll()
                
        self.content.StyleSetSpec(1, f"size:14,bold,fore:#777777")
        self.content.StyleSetSpec(2, f"size:11,italic,fore:#478F0B")
        self.content.StyleSetSpec(3, f"size:14,bold,fore:#AA0000")
        self.content.StyleSetSpec(4, f"size:14,italic,bold,fore:#0088FF")
        self.content.StyleSetSpec(5, f"size:14,italic,bold,fore:#000000")
        
        text1 = "Lorem ipsum dolor sit amet consectetur adipiscing elit"
        text2 = "Lorem земля dolor sit amet consectetur adipiscing elit"

        # depending on the line used, the text is styled correctly or not
        text = text2
        self.content.SetText(text)

        self.content.StartStyling(0)
        
        for i, word in enumerate(text.split()):
            # number of characters + trailing space
            length = len(word) + 1

            # the workaround:
            # number of bytes + trailing space
            #length = len(word.encode()) + 1
            
            self.content.SetStyling(length, i % 6)

if __name__ == '__main__':
    app = wx.App(0)
    frame = MainWindow(None, -1, "Example", size=wx.Size(900, 500))
    frame.Show(1)
    app.MainLoop()

I can confirm I see the same problem. It’s not immediately obvious what the problem is, because Scintilla (which wxStyledTextCtrl uses) seems to be being configured to use UTF-8.

There was a somewhat similar thread last year which involved selecting, rather than styling unicode text in an STC. See: How can I select unicode text range in stc.StyledTextCtrl widgets?

I have reviewed the thread from your link, saw your suggestion of using the PositionAfter() method - I didn’t quite understand, if I use that, do I have to call it the number of times equal the number of characters in each word I want to style (i.e. to iterate over each single character to reach the desired final position)?

While checking the documentation on that, I found a PositionRelative() method that works for me, the final loop in the __init__() method in my code above should be something like:

position = 0
for i, word in enumerate(text.split()):
    position2 = self.content.PositionRelative(position, len(word) + 1)
    length = position2 - position
    
    self.content.SetStyling(length, i % 6)
    position = position2

If there’s indeed a distinction between utf-8 characters and their byte representation, built into StyledTextCtrl and needed to be taken into account, this is probably the best solution I can hope for, perhaps.

Yes, in that suggestion I was trying to create a method which could select unicode text between two arbitrary positions (as would be returned by a search operation for example).

In your example, because you are iterating over the whole text, you can use the end point of the previous step as the start point of the next step which would be more efficient than my suggested code.