String with unicode 'digit-4' code

Jake_Larrimore · July 12, 2019, 6:46pm

Hey Group–
I am being passed a string that may contain Unicode in it’s 4-digit format – u’\u4eb0’. I’m asked to be able to display said string in a Richtext and need to decode the Unicode to display as well. Is there a simple way to decode a string that may (or may not) contain Unicode so that when I pass it into richtext.WriteText(text), it knows to display the unicode, without parsing the string to do it manually?

I attached an Example of what I’m dealing with. I can get it to display the Unicode if I pass it in alone–

richtext.WriteText( u’\u4eb0’)

– but if it is in the middle of a string it doesn’t recognize it as Unicode. I’d rather not have to parse the hole thing to search for unicode, I know there must be a simpler way.

Thanks,
Jake

Environment:
Python 2.7/ 32 bit
wxPython 2.8 (unicode version)

richTextCtrl.py (2.48 KB)

Werner2 · July 12, 2019, 8:38pm

Hi Jake,

Hey Group--
I am being passed a string that may contain Unicode in it's 4-digit format -- u'\u4eb0'. I'm asked to be able to display said string in a Richtext and need to decode the Unicode to display as well. Is there a simple way to decode a string that may (or may not) contain Unicode so that when I pass it into richtext.WriteText(text), it knows to display the unicode, without parsing the string to do it manually?

I attached an Example of what I'm dealing with. I can get it to display the Unicode if I pass it in alone--

richtext.WriteText( u'\u4eb0')

-- but if it is in the middle of a string it doesn't recognize it as Unicode. I'd rather not have to parse the hole thing to search for unicode, I know there must be a simpler way.

Is the string really coming in as you show in the code? It would be a bit odd to have "u'\u4eb0'" in the middle of some string you get.

If it is defined as u'' it works for me, e.g.:

     testScript = u" Hello world! \u4eb0' . Hello, again \n" # works
     panel.ed.WriteText(testScript)
     t2 = " Hello world! u'\u4eb0'. Hello, again \n" # does not work
     panel.WriteText(t2)
     panel.WriteText( u'\u4eb0') # works

Werner

···

On 9/19/2014 18:51, Jake Larrimore wrote:

Tim_Roberts · July 12, 2019, 8:38pm

Jake Larrimore wrote:

I am being passed a string that may contain Unicode in it's 4-digit
format -- u'\u4eb0'. I'm asked to be able to display said string in a
Richtext and need to decode the Unicode to display as well. Is there
a simple way to decode a string that may (or may not) contain Unicode
so that when I pass it into richtext.WriteText(text), it knows to
display the unicode, without parsing the string to do it manually?

I attached an Example of what I'm dealing with. I can get it to
display the Unicode if I pass it in alone--

richtext.WriteText( u'\u4eb0')

-- but if it is in the middle of a string it doesn't recognize it as
Unicode. I'd rather not have to parse the hole thing to search for
unicode, I know there must be a simpler way.

The difference here is not between "passing it in alone" and "in the
middle". The difference is in the datatypes. If you change your
example from this:
testScript = " Hello world! u'\u4eb0'. Hello, again \n"
to this:
testScript = u" Hello world! \u4eb0. Hello, again \n"
it works fine. Can you see the difference?

The u"xxx" and u'xxx' forms are part of Python syntax for string
constants, and are handled at compile time (for some value of
"compile"). It's not part of run-time string handling. The first case
is creating an 8-bit string. In an 8-bit string, the interpreter looks
for things like \x00 and converts it to a single byte. The second case
is creating a Unicode string. In a Unicode string, the interpreter
looks for things like \u4eb0 and converts it to a single character.

···

--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Jake_Larrimore · July 12, 2019, 8:38pm

Tim and Werner:
I see now. Thanks for the quick reply. Tim your explanation was spot on and very helpful. I completely understand now.

Best regards,
Jake

···

On Friday, September 19, 2014 1:40:41 PM UTC-4, Tim Roberts wrote:

Jake Larrimore wrote:

I am being passed a string that may contain Unicode in it’s 4-digit

format – u’\u4eb0’. I’m asked to be able to display said string in a

Richtext and need to decode the Unicode to display as well. Is there

a simple way to decode a string that may (or may not) contain Unicode

so that when I pass it into richtext.WriteText(text), it knows to

display the unicode, without parsing the string to do it manually?

I attached an Example of what I’m dealing with. I can get it to

display the Unicode if I pass it in alone–

richtext.WriteText( u’\u4eb0’)

– but if it is in the middle of a string it doesn’t recognize it as

Unicode. I’d rather not have to parse the hole thing to search for

unicode, I know there must be a simpler way.

The difference here is not between “passing it in alone” and "in the

middle". The difference is in the datatypes. If you change your

example from this:
testScript = " Hello world! u'\u4eb0'. Hello, again \n"
to this:
testScript = u" Hello world! \u4eb0. Hello, again \n"
it works fine. Can you see the difference?

The u"xxx" and u’xxx’ forms are part of Python syntax for string

constants, and are handled at compile time (for some value of

“compile”). It’s not part of run-time string handling. The first case

is creating an 8-bit string. In an 8-bit string, the interpreter looks

for things like \x00 and converts it to a single byte. The second case

is creating a Unicode string. In a Unicode string, the interpreter

looks for things like \u4eb0 and converts it to a single character.

–
Tim Roberts, ti...@probo.com

Providenza & Boekelheide, Inc.

Jake_Larrimore · July 12, 2019, 8:38pm

Sorry all it looks like I spoke too soon…
I’m getting the following:
testScript = u"Hello world! \u4eb0. Hello, again \n" #works fine
However:
testScript = “Hello world! \u4eb0. Hello, again \n”
testUnicode = unicode(testScript,“utf-8”)
richText.WriteText(testUnicode) #doesn’t work?

Even though type(testUnicode) = <type ‘unicode’>

What am I missing here?
Thanks again,
Jake

richTextCtrl.py (1.67 KB)

···

On Friday, September 19, 2014 1:45:46 PM UTC-4, Jake Larrimore wrote:

On Friday, September 19, 2014 1:40:41 PM UTC-4, Tim Roberts wrote:
Jake Larrimore wrote:

I am being passed a string that may contain Unicode in it’s 4-digit

format – u’\u4eb0’. I’m asked to be able to display said string in a

Richtext and need to decode the Unicode to display as well. Is there

a simple way to decode a string that may (or may not) contain Unicode

so that when I pass it into richtext.WriteText(text), it knows to

display the unicode, without parsing the string to do it manually?

I attached an Example of what I’m dealing with. I can get it to

display the Unicode if I pass it in alone–

richtext.WriteText( u’\u4eb0’)

– but if it is in the middle of a string it doesn’t recognize it as

Unicode. I’d rather not have to parse the hole thing to search for

unicode, I know there must be a simpler way.

The difference here is not between “passing it in alone” and "in the

middle". The difference is in the datatypes. If you change your

example from this:
testScript = " Hello world! u'\u4eb0'. Hello, again \n"
to this:
testScript = u" Hello world! \u4eb0. Hello, again \n"
it works fine. Can you see the difference?

The u"xxx" and u’xxx’ forms are part of Python syntax for string

constants, and are handled at compile time (for some value of

“compile”). It’s not part of run-time string handling. The first case

is creating an 8-bit string. In an 8-bit string, the interpreter looks

for things like \x00 and converts it to a single byte. The second case

is creating a Unicode string. In a Unicode string, the interpreter

looks for things like \u4eb0 and converts it to a single character.

–
Tim Roberts, ti...@probo.com

Providenza & Boekelheide, Inc.
Tim and Werner:
I see now. Thanks for the quick reply. Tim your explanation was spot on and very helpful. I completely understand now.

Best regards,
Jake

Tim_Roberts · July 12, 2019, 8:38pm

It’s because the \u syntax is not handled by the string module. It
is handled by the interpreter when it parses a Unicode string
CONSTANT, not when it converts to Unicode. When I do this:
sss = “abc\n\x33”
I am creating a string that contains 5 characters, with the values
0x61 0x62 0x63 0x0A 0x33. The string contains no backslashes, nor
does it contain the letter “x”. Similarly, when I do this:
uuu = u"abc\n\u4eb0"
I am creating a string that contains 5 characters: 0x0061 0x0062
0x0063 0x000A 0x4EB0. Again, the string contains no backslashes,
nor does it contain the letter “u”.
But when I say this:
sss = “abc\n\u4eb0”
I have created a 10-character string. It starts with 61 62 63 0A,
but then it actually contains a backslash and a “u”. Those are
valid ASCII characters, and they are valid Unicode characters. So,
when you convert that to Unicode, it happily converts the backslash
and the “u4eb0” to their Unicode equivalents.
If you are receiving 8-bit strings that contain these Unicode
escapes, then you are going to have to parse it by hand after you
convert it to Unicode. If you need to embed Unicode code points in
an 8-bit string, then you need to check into using UTF-8. The UTF-8
for U+4EB0 is E4 BA B0. So, you could say this:
testScript = “Hello world! \xe4\xba\xb0. Hello, again \n”
testUnicode = testScript.decode(‘utf-8’)

···

Jake Larrimore wrote:

Sorry all it looks like I spoke too soon…
I’m getting the following:

      testScript = u"Hello world! \u4eb0. Hello, again \n" **            #works

fine**

      However:

      testScript = "Hello world! \u4eb0. Hello, again \n"
      testUnicode = unicode(testScript,"utf-8")
      richText.WriteText(testUnicode) **#doesn't work?**



      Even though type(testUnicode) = <type 'unicode'>



      What am I missing here?

-- Tim Roberts, Providenza & Boekelheide, Inc.

timr@probo.com

Nathan_McCorkle · July 12, 2019, 8:38pm

This works for me… don’t try printing it to the console though, it doesn’t seem to know how to print that character (it shows up in the RichText box as a Asian looking character)

testScript = “Hello world! \u4eb0. Hello, again \n”

testUnicode = unicode(testScript,“unicode-escape”)

print type(testUnicode)

panel.WriteText(testUnicode)

···

On Friday, September 19, 2014 11:23:48 AM UTC-7, Jake Larrimore wrote:

Sorry all it looks like I spoke too soon…
I’m getting the following:
testScript = u"Hello world! \u4eb0. Hello, again \n" #works fine
However:
testScript = “Hello world! \u4eb0. Hello, again \n”
testUnicode = unicode(testScript,“utf-8”)
richText.WriteText(testUnicode) #doesn’t work?

Even though type(testUnicode) = <type ‘unicode’>

What am I missing here?

Jake_Larrimore · July 12, 2019, 8:38pm

Tim–
Thanks for the explanation. That makes sense.

Nathan–
That worked for me also. This is what I ended up using. Though Tim’s explanation was also very helpful as to WHY it wasn’t working. I’m glad I don’t have to parse through by hand.

Thanks again guys,
Jake

···

On Friday, September 19, 2014 5:03:38 PM UTC-4, Nathan McCorkle wrote:

On Friday, September 19, 2014 11:23:48 AM UTC-7, Jake Larrimore wrote:

Sorry all it looks like I spoke too soon…
I’m getting the following:
testScript = u"Hello world! \u4eb0. Hello, again \n" #works fine
However:
testScript = “Hello world! \u4eb0. Hello, again \n”
testUnicode = unicode(testScript,“utf-8”)
richText.WriteText(testUnicode) #doesn’t work?

Even though type(testUnicode) = <type ‘unicode’>

What am I missing here?

This works for me… don’t try printing it to the console though, it doesn’t seem to know how to print that character (it shows up in the RichText box as a Asian looking character)

testScript = “Hello world! \u4eb0. Hello, again \n”

testUnicode = unicode(testScript,“unicode-escape”)

print type(testUnicode)

panel.WriteText(testUnicode)

Chris_Barker1 · July 12, 2019, 8:38pm

I’ve lost track a bit as to what teh OP really needs, but a note:

···

On Fri, Sep 19, 2014 at 12:37 PM, Tim Roberts timr@probo.com wrote:

If you are receiving 8-bit strings that contain these Unicode
escapes, then you are going to have to parse it by hand after you
convert it to Unicode.

or use eval() – much easier, though always potentially dangerous:

In [10]: print rs

this is a ‘raw’ string that has a unicode escape in it: \u00B0

In [11]: eval(‘u"%s"’%rs)

Out[11]: u"this is a ‘raw’ string that has a unicode escape in it: \xb0"

in this case, the eval() creates an actual unicode object.

There may be a way to invoke python’s string parsing without the general purpose eval – I haven’t looked.

-Chris

If you need to embed Unicode code points in
an 8-bit string, then you need to check into using UTF-8. The UTF-8
for U+4EB0 is E4 BA B0. So, you could say this:

    testScript = "Hello world!  \xe4\xba\xb0. Hello, again \n"

    testUnicode = testScript.decode('utf-8')

In that case, I’d just use unicode, then decode it to utf-8 – just like the normal old way to do it. why write what is essentially a utf-8 encoder, when python gives you one?

-CHB

Christopher Barker, Ph.D.

Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov