Sorting Unicode filenames

Bostjan_Mejak1 · January 26, 2011, 7:52pm

I have a question about sorting Unicode filenames of a list created by glob.glob(‘C:*’). I have some Unicode filenames in my C:\ directory which are not sorted with respect to Unicode letters. How can I achieve such sorting? Please note that the built-in list methods sorted() and sort() do not achieve this goal.

Robin · January 26, 2011, 8:02pm

Are they Unicode objects, or strings encoded with the system locale's default filesystem encoding? If the latter then you'll probably want to convert them to Unicode objects first and then see http://lmgtfy.com/?q=python+sort+unicode

···

On 1/26/11 11:52 AM, Boï¿½tjan Mejak wrote:

I have a question about sorting Unicode filenames of a list created by
glob.glob('C:\*'). I have some Unicode filenames in my C:\ directory
which are not sorted with respect to Unicode letters. How can I achieve
such sorting? Please note that the built-in list methods sorted() and
sort() do not achieve this goal.

--
Robin Dunn
Software Craftsman

Bostjan_Mejak1 · January 26, 2011, 10:08pm

They are Unicode objects.

Anders_J_Munch1 · January 27, 2011, 10:34am

Boštjan Mejak wrote:

They are Unicode objects.

Are you sure? If you pass a non-unicode string to glob.glob or

os.listdir, you will not get Python unicode objects. Could you show

us the output from

sorted(glob.glob(u’c:\*'))

(Note the u and the doubled backslashes.)

If the default unicode sort isn’t satisfactory, you may need to set the locale and use a locale-aware comparison function:

sorted(glob.glob(u’c:\*'), cmp=locale.strcoll)

regards, Anders

Bostjan_Mejak1 · January 27, 2011, 11:16am

sorted(glob.glob(u’c:\*’)) gives me this:

[u’c:\$Recycle.Bin’, u’c:\Documentation’, u’c:\Documents and Settings’,

u’c:\Intel’, u’c:\My Music’, u’c:\PerfLogs’, u’c:\Program Files’, u’c:\Program Files (x86)’,

u’c:\ProgramData’, u’c:\Python27’, u’c:\Python31’, u’c:\System Volume Information’,

u’c:\Temp’, u’c:\Update’, u’c:\Users’, u’c:\VAIO Entertainment’, u’c:\VC_RED.MSI’,

u’c:\VC_RED.cab’, u’c:\Windows’, u’c:\WirelessDiagLog.csv’, u’c:\_FS_SWRINFO’,

u’c:\eula.1028.txt’, u’c:\eula.1031.txt’, u’c:\eula.1033.txt’, u’c:\eula.1036.txt’,

u’c:\eula.1040.txt’, u’c:\eula.1041.txt’, u’c:\eula.1042.txt’, u’c:\eula.2052.txt’,

u’c:\eula.3082.txt’, u’c:\globdata.ini’, u’c:\hiberfil.sys’, u’c:\install.exe’,

u’c:\install.ini’, u’c:\install.res.1028.dll’, u’c:\install.res.1031.dll’, u’c:\install.res.1033.dll’,

u’c:\install.res.1036.dll’, u’c:\install.res.1040.dll’, u’c:\install.res.1041.dll’,

u’c:\install.res.1042.dll’, u’c:\install.res.2052.dll’, u’c:\install.res.3082.dll’, u’c:\lotus’,

u’c:\pagefile.sys’, u’c:\pushover’, u’c:\sql2ksp3’, u’c:\test.xml’, u’c:\vcredist.bmp’,

u’c:\vcredist_x86.log’, u’c:\\u010dtest.txt’, u’c:\\u017etest.txt’]

Please note the last two files in the list. They are supposed to be čtest.txt and žtest.txt, respectively.

Here is my handler function:

def OnButtonUnderWindows(self, event):

    """

    Count folders and files of the C:\ partition under Microsoft Windows

    operating system.

    """

    wx.BeginBusyCursor()

    self.Disable()

    wx.Sleep(2)

    wx.SafeYield(self)

    folders = 0

    files = 0

    directory = u'C:\\*'

    items = sorted(glob(directory),

                             cmp=locale.strcoll)

    newLine = u'\n'

    joinedItems = newLine.join(items)

    self.textField.WriteText(joinedItems)

    for item in items:

        if os.path.isdir(item):

            folders += 1

        elif os.path.isfile(item):

            files += 1

    self.statusBar.SetStatusText(text='Folders found: {number}'.format(number=folders),

                                                    number=1)

    self.statusBar.SetStatusText(text='Files found: {number}'.format(number=files),

                                                    number=2)

    self.button.Disable()

    self.Enable()

    wx.EndBusyCursor()

This function does not sort the Unicode filenames at all. It just places the Unicode filenames at the end. Please assist me.

Anders_J_Munch1 · January 27, 2011, 12:40pm

Boštjan Mejak wrote:

Please note the last two files in the list. They are supposed to be

čtest.txt and žtest.txt, respectively.

Here is my handler function:

Boštjan,

You need to set the locale before using locale.strcoll makes a difference.

Read up on locales and the locale module.

Anders

Bostjan_Mejak1 · January 27, 2011, 12:56pm

How would you suggest I set the locale?

Anders_J_Munch1 · January 27, 2011, 1:35pm

Boštjan Mejak wrote:

How would you suggest I set the locale?

I already told you: I would suggest you read up on the subject

Note that this is not actually a wxPython issue; the locale module is

a standard Python module, so comp.lang.python is more appropriate.

Anders

Bostjan_Mejak1 · January 27, 2011, 2:17pm

I have managed to make it work. It works on Windows and on Linux as well. Thanks for your help.

But I have one additional question about how the Python console displays Unicode characters under Windows OS.

glob.glob(‘C:*’)

[‘C:\\xe8test.txt’, ‘C:\\x9atest.txt’, ‘C:\\x9etest.txt’]

Why is this not displayed as [‘C:\čtest.txt’, ‘C:\štest.txt’, ‘C:\čtest.txt’]? In Python 3.x this issue is resolved. How can I make my Python 2.x console display Unicode characters?

Tim_Roberts · January 27, 2011, 7:20pm

Boštjan Mejak wrote:

    sorted(glob.glob(u'c:\\*'))

gives me this:

      [u'c:\\$Recycle.Bin',
u’c:\Documentation’, u’c:\Documents and Settings’,

…

      u'c:\\pagefile.sys',
u’c:\pushover’, u’c:\sql2ksp3’, u’c:\test.xml’,
u’c:\vcredist.bmp’,

      u'c:\\vcredist_x86.log',
u’c:\\u010dtest.txt’, u’c:\\u017etest.txt’]

      Please note the last two
files in the list. They are supposed to be čtest.txt and
žtest.txt, respectively.

Yes, they are.  If you had used print, you'd see that's what \u010d

and \u017e are.

        This function does not sort the Unicode
filenames at all. It just places the Unicode filenames at
the end. Please assist me.

No, they are being sorted, but they are sorted in strict numerical

order, not lexicographically. That’s why, for example, “P” sorts
before “e”. The normal string comparison operators don’t understand
where č should be ordered. I thought there was a comparison
function that knew how to do this, but I haven’t been able to find
it in 5 minutes of Googling. There must be some folks on this list
who know.

Also note that ALL of those are "Unicode filenames".  The difference

in those last two is that they contain characters beyond the first
128.

···

-- Tim Roberts, Providenza & Boekelheide, Inc.

timr@probo.com

Tim_Roberts · January 27, 2011, 7:27pm

Boštjan Mejak wrote:

I have managed to make it work. It works on Windows and on Linux as
well. Thanks for your help.

But I have one additional question about how the Python console
displays Unicode characters under Windows OS.

>>> glob.glob('C:\*')
['C:\\\xe8test.txt', 'C:\\\x9atest.txt', 'C:\\\x9etest.txt']

Why is this not displayed as ['C:\\čtest.txt', 'C:\\štest.txt',
'C:\\čtest.txt']? In Python 3.x this issue is resolved. How can I make
my Python 2.x console display Unicode characters?

Part of this is the difference between typing a string at the
interpreter's prompt, and using the "print" statement. Those use two
different methods. Note:

  C:\tmp>python
  Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit
(Intel)] on win32
  Type "help", "copyright", "credits" or "license" for more information.
  >>> x = '\xe8'
  >>> x
  '\xe8'
  >>> print x
  Φ
  >>> x = '\xa9'
  >>> x
  '\xa9'
  >>> print x
  ⌐
  >>>

When you just type a string, it uses the __repr__ function to translate
it. __repr__ converts everything outside of the visible ASCII range
(0x20 to 0x7F) to hex escape codes, because the other characters cannot
be consistently printed. "Print" uses __str__, which doesn't do that
translation.

The other issue is your code page. For you, \xE8 translated to "lower
case c with caron", because that's your code page. For me, \xE8
translates to "Greek upper-case phi". That's why __repr__ prints the
hex value, because it is portable.

If your code page is set up properly, then
print glob.glob('c:\\*')
that should do what you expect. Note the doubled backslash; you got
away with single because * is not an escape character, but you need to
get into the right habit.

···

--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Bostjan_Mejak1 · January 27, 2011, 11:26pm

I get the same result of Python interpreter displaying print glob.glob(‘C:\*’) as […, ‘C:\\xe8test.txt’, ‘C:\\x9atest.txt’, ‘C:\\x9etest.txt’]. What is it wrong now?

Tim_Roberts · January 28, 2011, 12:28am

Boï¿½tjan Mejak wrote:

  I get the same result of Python interpreter displaying ï¿½print
glob.glob(‘C:\*’) ï¿½ as ï¿½ […,ï¿½’C:\\xe8test.txt’,
‘C:\\x9atest.txt’, ‘C:\\x9etest.txt’]. What is it wrong now?

You're quite right, that was foolish of me.ï¿½ It's a list, and the

list elements gets converted to string using repr anyway.ï¿½ Try
this instead:

ï¿½ï¿½ï¿½ for k in glob.glob('c:\\*'):

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ print k

···

-- Tim Roberts, Providenza & Boekelheide, Inc.

timr@probo.com

Bostjan_Mejak1 · January 28, 2011, 1:36am

for k in glob.glob(u’c:\*’):
print k

You forgot the u.

Tim_Roberts · January 28, 2011, 1:49am

Boï¿½tjan Mejak wrote:

for

k in glob.glob(u’c:\*'):

      ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ print k

      You
forgot the u.

Your example didn't have the "u" either.ï¿½ I assumed you wanted to

see the 8-bit versions of the strings.

···

-- Tim Roberts, Providenza & Boekelheide, Inc.

timr@probo.com

Bostjan_Mejak1 · January 28, 2011, 3:15am

What is the difference between Python 2.x’s sorted() built-in function which has a cmp argument, and the Python 3.x’s sorted() built-in function which has a key argument? What is the difference between these two arguments? Do they do the same thing?

Bostjan_Mejak1 · January 28, 2011, 12:17pm

Python 2.x:
sorted(glob.glob(u’C:\*’), cmp=locale.strcoll)

Python 3.x:

sorted(glob.glob(u’C:\*’), key=locale.strcoll)

What is the difference here?

Tim_Roberts · January 28, 2011, 6:32pm

Boï¿½tjan Mejak wrote:

    Python
2.x:
sorted(glob.glob(u’C:\*'),
cmp=locale.strcoll)

      Python

3.x:

      sorted(glob.glob(u'C:\\*'),

key=locale.strcoll)

      What
is the difference here?

Well, one difference is that the first one will work, while the

second one will not.ï¿½

The "cmp" function takes two entries and returns -1, 0, or 1 based

on how the two items compare to each other.ï¿½ It is called once for
every comparison in a sort, and there can be a LOT of comparisons in
a sorting operation.

The "key" function is a different approach.ï¿½ It takes one entry and

returns a “key” for that entry which can then be used in a simple
numerical sort.ï¿½ It basically converts each record into a form that
can be sorted in order simply.ï¿½ The “key” function only has to be
called once for each record.ï¿½ So, if “c with caron” is supposed to
sort as equal to “c”, then the “key” function might change one to
the other.

"locale.strcoll" is a cmp-style function.ï¿½ Here is a mailing list

message that shows a sneaky way to wrap a “cmp” function so that it
can be used as a “key” function.ï¿½ It even uses “locale.strcoll” as
its example:

ï¿½ï¿½ï¿½

···

http://mail.python.org/pipermail/python-list/2010-January/1234158.html

-- Tim Roberts, Providenza & Boekelheide, Inc.

timr@probo.com

Jean-Michel_Fauth1 · January 28, 2011, 7:04pm

By encoding the 'unicode' with the coding of the console
which will receive it.
Eg:
<unicode type>.encode(sys.stdout.encoding, 'replace')
or
<unicode type>.encode(a_coding, 'replace')

···

On Jan 27, 3:17 pm, Boštjan Mejak <bostjan.me...@gmail.com> wrote:

... How can I make my
Python 2.x console display Unicode characters?

---

[Shortly]

A 'unicode' is coding-less representation and it
should always be encoded into/for the environment
which will use it, it can be a file, a GUI, a console,
a db, a printer, ...

The coding of the characters is a domain per se. It is
independent from the platforms, the apps or the hardware,
even if all of them have to use it.

jmf

Robin · January 28, 2011, 7:13pm

Python 3 dropped support for the cmp arg, (also cmp() and __cmp__) so instead you need to use the key arg to specify a function that returns a value to be used for the comparison. I don't think that locale.strcoll fits that requirement so you'll probably need to use something else there. BTW, the sorted key parameter has been available since at least Python 2.5 so you could be using it now. See the docs for sorted().

···

On 1/28/11 4:17 AM, Boï¿½tjan Mejak wrote:

Python 2.x:
sorted(glob.glob(u'C:\\*'), cmp=locale.strcoll)

Python 3.x:
sorted(glob.glob(u'C:\\*'), key=locale.strcoll)

What is the difference here?

--
Robin Dunn
Software Craftsman