How can I select unicode text range in stc.StyledTextCtrl widgets?

znsoooo · March 24, 2024, 3:52am

Here is the demo code, I wrote comments in the code:

Environment is Python 3.8 and wxpython-4.2.1. But the other versions behave the same.

import wx
import wx.stc as stc


class MyTextCtrl(stc.StyledTextCtrl):
    def __init__(self, parent):
        stc.StyledTextCtrl.__init__(self, parent)

    def SetUnicodeSelection(self, p1, p2):
        # This is my user define function. I convert the unicode text range into bytes range.
        text = self.GetValue()
        p1, p2 = (len(text[:p].encode()) for p in (p1, p2))  # unicode index -> bytes index
        self.SetSelection(p1, p2)


if __name__ == '__main__':
    app = wx.App()

    frame = wx.Frame(None, -1, 'Test Unicode Selection')

    text = MyTextCtrl(frame)

    # Initial some unicode text
    text.SetValue(
        '有日月朝暮悬，有鬼神掌着生死权。\n'
        '天地也，只合把清浊分辨，可怎生糊突了盗蹠颜渊？\n'
        '为善的受贫穷更命短，造恶的享富贵又寿延！\n'
        '天地也，做得个怕硬欺软，却原来也这般顺水推船。\n'
        '地也，你不分好歹何为地？天也，你错勘贤愚枉做天！\n'
        '哎，只落得两泪涟涟。\n'
    )

    # The text of "，有鬼神掌着生死权" is the 6 to 15 chars of the demo text.
    # If I select the range of 6 to 15, it will be wrong.
    # How can I select this text by builtin function, but not my user define function?

    # text.SetSelection(6, 15)       # <- this selection will be wrong
    text.SetUnicodeSelection(6, 15)  # <- use user define function will be correct, and the definition is above

    frame.Center()
    frame.Show()

    app.MainLoop()

znsoooo · April 4, 2024, 11:42pm

The method I’m using can set the correct selection range, but it may cause performance problem in some cases.
For example, when the text is very long (e.g. 100K), and there are many selections (e.g. 3000) that need to be set. I have to encode the long text many times, and calculate the length after encoding, which takes a very long time.
Therefore, I’m wondering if there are any better built-in functions available to achieve this function?

znsoooo · April 19, 2024, 10:13pm

Anyone can help me?

RichardT · April 20, 2024, 7:55am

In your real application, what is the process that determines which characters should be selected?

For example, does it search for a particular substring?

Or does it actually select the characters based on their apparent position in the STC (as in your example)?

znsoooo · April 21, 2024, 1:56am

In my real application, I enter a regular expression in a text box, and highlights all matching results in the results.

If I use my method to calculate the string selection range, the string needs to be truncated and encoded many many times, which takes a lot of time, and the algorithm complexity is O(n^2).

So I was wondering is there were any built-in functions that would do this job more directly?

Here is a demo code which can be run directly:

import re
import wx
import wx.stc as stc

class MyTextCtrl(stc.StyledTextCtrl):
    def __init__(self, parent):
        stc.StyledTextCtrl.__init__(self, parent)
        self.StyleSetSpec(1, 'back:#FFFF00')

    def SetUnicodeHighlights(self, spans):
        text = self.GetValue()
        self.StartStyling(0)
        self.SetStyling(len(text.encode()), 0)
        for p1, p2 in spans:
            p1, p2 = (len(text[:p].encode()) for p in (p1, p2))
            self.StartStyling(p1)
            self.SetStyling(p2 - p1, 1)

    def StartStyling(self, start):
        try:
            super().StartStyling(start)
        except TypeError: # compatible for old version
            super().StartStyling(start, 0xFFFF)

class MyPanel(wx.Panel):
    def __init__(self, parent):
        wx.Panel.__init__(self, parent)

        self.tc1 = wx.TextCtrl(self, -1, '玻璃')
        self.tc2 = MyTextCtrl(self)
        self.tc2.SetValue('我可以吞下玻璃而不伤身体\n' * 16000)

        box = wx.BoxSizer(wx.VERTICAL)
        box.Add(self.tc1, 0, wx.EXPAND | wx.ALL, 3)
        box.Add(self.tc2, 1, wx.EXPAND | wx.ALL, 3)
        self.SetSizer(box)

        self.tc1.Bind(wx.EVT_TEXT, self.OnText)

        wx.CallLater(300, self.OnText, -1)

    def OnText(self, evt):
        find = self.tc1.GetValue()
        if find:
            text = self.tc2.GetValue()
            spans = [m.span() for m in re.finditer(find, text)]
            self.tc2.SetUnicodeHighlights(spans)

if __name__ == '__main__':
    app = wx.App()
    frame = wx.Frame(None, -1, 'MyTextCtrl', size=(400, 800))
    MyPanel(frame)
    frame.Center()
    frame.Show()
    app.MainLoop()

I initialized a very long text in self.tc2 which has 8000 lines (you can increase or decrease this number depending on your computer’s performance).

Then enter the search text in self.tc1 (Like "玻璃" or "身体"), and the application will highlight all matching targets.

In my case, there would be 8000 matched selections, and the running spend time about 1 second (it is too long).

If you increase the number 8000, the spending time will be increased by the square, and the algorithm complexity is O(n^2).

RichardT · April 21, 2024, 7:33am

Have you tried using the FindText() method?

The simple example below appears to select the correct unicode characters:

import wx
import wx.stc as stc


if __name__ == '__main__':

    app = wx.App()
    frame = wx.Frame(None, -1, 'Test Unicode Selection')
    text = stc.StyledTextCtrl(frame)

    # Initial some unicode text
    text.SetValue(
        '有日月朝暮悬，有鬼神掌着生死权。\n'
        '天地也，只合把清浊分辨，可怎生糊突了盗蹠颜渊？\n'
        '为善的受贫穷更命短，造恶的享富贵又寿延！\n'
        '天地也，做得个怕硬欺软，却原来也这般顺水推船。\n'
        '地也，你不分好歹何为地？天也，你错勘贤愚枉做天！\n'
        '哎，只落得两泪涟涟。\n'
    )

    last = text.GetLastPosition()
    start, end = text.FindText(0, last, "，有鬼神掌着生死权")
    text.SetSelection(start, end)

    frame.Center()
    frame.Show()

    app.MainLoop()

To search for a regular expression you would need to pass flags=stc.STC_FIND_REGEXP to FindText().

To search for all the matches in the text you would need to loop around the FindText() method, each time setting the minPos parameter to the end value from the previous call.

However, I don’t know if this would actually be quicker than what you are currently doing.

RichardT · April 21, 2024, 10:01am

Here is your second example, modified to use FindText():

import wx
import wx.stc as stc
from time import time

class MyTextCtrl(stc.StyledTextCtrl):
    def __init__(self, parent):
        stc.StyledTextCtrl.__init__(self, parent)
        self.StyleSetSpec(1, 'back:#FFFF00')

    def SetUnicodeHighlights(self, text):
        first = 0
        last = self.GetLastPosition()
        self.StartStyling(first)
        self.SetStyling(last, 0)

        while True:
            start, end = self.FindText(first, last, text, flags=stc.STC_FIND_REGEXP)
            if start == -1 or end == -1:
                break
            self.StartStyling(start)
            self.SetStyling(end - start, 1)
            first = end

    def StartStyling(self, start):
        try:
            super().StartStyling(start)
        except TypeError: # compatible for old version
            super().StartStyling(start, 0xFFFF)

class MyPanel(wx.Panel):
    def __init__(self, parent):
        wx.Panel.__init__(self, parent)

        self.tc1 = wx.TextCtrl(self, -1, '玻璃')
        self.tc2 = MyTextCtrl(self)
        self.tc2.SetValue('我可以吞下玻璃而不伤身体\n' * 16000)

        box = wx.BoxSizer(wx.VERTICAL)
        box.Add(self.tc1, 0, wx.EXPAND | wx.ALL, 3)
        box.Add(self.tc2, 1, wx.EXPAND | wx.ALL, 3)
        self.SetSizer(box)

        self.tc1.Bind(wx.EVT_TEXT, self.OnText)

        wx.CallLater(300, self.OnText, -1)

    def OnText(self, evt):
        find = self.tc1.GetValue()
        if find:
            t1 = time()
            self.tc2.SetUnicodeHighlights(find)
            t2 = time()
            print(t2-t1)

if __name__ == '__main__':
    app = wx.App()
    frame = wx.Frame(None, -1, 'MyTextCtrl', size=(400, 800))
    MyPanel(frame)
    frame.Center()
    frame.Show()
    app.MainLoop()

Check it is highlighting the correct unicode characters:

On my old linux PC your second example took 6.9 seconds to highlight the search text.
This version using FindText() took 0.1 seconds.
However, I have not tested it with an actual regular expression.

znsoooo · April 21, 2024, 12:47pm

Interesting solution! I learned it, thank you very much!!

But in my tests, it supports regular expressions, but not FULL regular expressions (e.g. “\w{3}”).

So it still doesn’t work in my application. I perfer to calculate the unicode string ranges in my own function, and highlight them in wx.stc.StyledTextCtrl.

komoto48g · April 21, 2024, 11:56am

You can generate the byte positions of a pattern within a TextRaw as follows:

    def grep(self, pattern, flags=re.M):
        yield from re.finditer(pattern.encode(), self.TextRaw, flags)

RichardT · April 21, 2024, 12:29pm

The only other idea I had was to use the PositionAfter() method which takes unicode characters into account:

import re
import wx
import wx.stc as stc
from time import time

class MyTextCtrl(stc.StyledTextCtrl):
    def __init__(self, parent):
        stc.StyledTextCtrl.__init__(self, parent)
        self.StyleSetSpec(1, 'back:#FFFF00')

    def SetUnicodeHighlights(self, spans):
        num_spans = len(spans)
        s = 0
        b = 0
        start = 0
        end = 0
        p1, p2 = spans[s]
        last = self.GetLastPosition()
        self.StartStyling(0)
        self.SetStyling(last, 0)

        for i in range(last+1):
            if i == p1:
                start = b
            elif i == p2:
                end = b
                self.StartStyling(start)
                self.SetStyling(end - start, 1)
                s += 1
                if s >= num_spans:
                    break
                p1, p2 = spans[s]
            b = self.PositionAfter(b)

    def StartStyling(self, start):
        try:
            super().StartStyling(start)
        except TypeError: # compatible for old version
            super().StartStyling(start, 0xFFFF)

class MyPanel(wx.Panel):
    def __init__(self, parent):
        wx.Panel.__init__(self, parent)

        self.tc1 = wx.TextCtrl(self, -1, '玻璃')
        self.tc2 = MyTextCtrl(self)
        self.tc2.SetValue('我可以吞下玻璃而不伤身体\n' * 16000)

        box = wx.BoxSizer(wx.VERTICAL)
        box.Add(self.tc1, 0, wx.EXPAND | wx.ALL, 3)
        box.Add(self.tc2, 1, wx.EXPAND | wx.ALL, 3)
        self.SetSizer(box)

        self.tc1.Bind(wx.EVT_TEXT, self.OnText)

        wx.CallLater(300, self.OnText, -1)

    def OnText(self, evt):
        find = self.tc1.GetValue()
        if find:
            text = self.tc2.GetValue()
            spans = [m.span() for m in re.finditer(find, text)]
            t1 = time()
            self.tc2.SetUnicodeHighlights(spans)
            t2 = time()
            print(t2-t1)

if __name__ == '__main__':
    app = wx.App()
    frame = wx.Frame(None, -1, 'MyTextCtrl', size=(400, 800))
    MyPanel(frame)
    frame.Center()
    frame.Show()
    app.MainLoop()

The SetUnicodeHighlights() call in this version takes 0.16 seconds on my linux PC.

znsoooo · April 21, 2024, 12:44pm

Regex pattern and string both convert to bytes type, which the results they get are not equivalent.

>>> re.findall('我..', '我可以吞下玻璃而不伤身体')[0]
'我可以'  # <- This is expected
>>> '我可以'.encode()
b'\xe6\x88\x91\xe5\x8f\xaf\xe4\xbb\xa5'  # <- This is expected
>>> re.findall('我..'.encode(), '我可以吞下玻璃而不伤身体'.encode())[0]
b'\xe6\x88\x91\xe5\x8f'  # <- This is not expected

znsoooo · April 21, 2024, 2:13pm

Thank you! It works, and I found if don’t use the self.PositionAfter function will faster.

I tested on 200000 lines string, the times were 1.23 seconds vs. 0.92 seconds.

class MyTextCtrl(stc.StyledTextCtrl):
    ...

    def SetUnicodeHighlights(self, spans):
        ...

        for i, c in enumerate(self.GetValue()):  # <- change here
            if i == p1:
                start = b
            elif i == p2:
                end = b
                self.StartStyling(start)
                self.SetStyling(end - start, 1)
                s += 1
                if s >= num_spans:
                    break
                p1, p2 = spans[s]
            b += len(c.encode())  # <- change here

But there’s still a small bug, but I’m not asking you for help me, I’m explaining why I’m hoping for a built-in function (but not a user-defined function).

Sometimes I need to highlight multiple groups in one match, so the “spans” is not always incremental:

import re
pattern = r'((\w+)@((\w+)\.(\w+)))'
string = 'znsoooo@example.com'
spans = [m.regs for m in re.finditer(pattern, string)]
print(spans)  # [((0, 19), (0, 7), (8, 19), (8, 15), (16, 19))]

In the above method, the unicode characters are counted one by one, and to calculate which span is matched. But if the sequence is not incremental, the calculation will goes wrong.

komoto48g · April 24, 2024, 2:54pm

You’re right. I overlooked that case.
How about creating a list every time you search for text:

>>> ls = sorted(set(self.PositionBefore(i) for i in range(self.TextLength)))

which maps Python string to byte positions?

znsoooo · April 26, 2024, 12:52pm

The code can work, but it takes 3 times as long as the previous example.

Thank you anyway.

komoto48g · April 26, 2024, 1:38pm

Thank you for the benchmark!