wx.lib.pdfviewer - special characters issue

Hi,

I just noticed that the viewer has problems with special characters such as �� etc etc.

The problem can be seen when one opens the attached .pdf in the viewer, if I view it with e.g. Adobe Reader the characters are fine.

Is this to do with the way PythonReports defines/uses the fonts?

Any hint on how to make this display correctly?

This is with 2.9.5 preview (classic).

Werner

Cellarbook wine list - portrait.pdf (70 KB)

Hi,

I just noticed that the viewer has problems with special characters such
as �� etc etc.

It is probably for much the same reason as why they don’t appear correctly here either :wink:

The problem can be seen when one opens the attached .pdf in the viewer,
if I view it with e.g. Adobe Reader the characters are fine.

Is this to do with the way PythonReports defines/uses the fonts?

Any hint on how to make this display correctly?

This is with 2.9.5 preview (classic).

Werner

I am just catching up with things after a few days away, but I will investigate as soon as I can

David

···

On Wednesday, May 22, 2013 2:39:36 PM UTC+1, werner wrote:

Hi David,

Hi,

I just noticed that the viewer has problems with special characters such
as �� etc etc.

How do I love all this encoding stuff.

In my “Sent” folder in Thunderbird it shows as it should “a accent” and “e accent” but the message which came in via google group is showing garbage.

Lets see does it work all the way if I reply to this on google group.
áé

It is probably for much the same reason as why they don’t appear correctly here either :wink:

The problem can be seen when one opens the attached .pdf in the viewer,
if I view it with e.g. Adobe Reader the characters are fine.

Is this to do with the way PythonReports defines/uses the fonts?

Any hint on how to make this display correctly?

This is with 2.9.5 preview (classic).

Werner

I am just catching up with things after a few days away, but I will investigate as soon as I can

I am catching up to, was off for a few days.

Thanks for adding it to your list:)
Werner

···

On Wednesday, 22 May 2013 16:29:29 UTC+2, David Hughes wrote:

On Wednesday, May 22, 2013 2:39:36 PM UTC+1, werner wrote:

Now, hold on a minute. The last two characters here did show up as
“a accent” and “e accent”, but in my Thunderbird, the two characters
in your original mail were the Hebrew letters “tet” and “alef”
(U+05D8 and U+05D0). What did you actually type?

···

werner wrote:

    On Wednesday, May 22, 2013 2:39:36 PM UTC+1, werner wrote:
      Hi,




      I just noticed that the viewer has problems with special

characters such

      as �� etc etc.

How do I love all this encoding stuff.

    In my "Sent" folder in Thunderbird it shows as it should "a

accent" and “e accent” but the message which came in via google
group is showing garbage.

    Lets see does it work all the way if I reply to this on google

group.

    áé
-- Tim Roberts, Providenza & Boekelheide, Inc.

timr@probo.com

Same for me in gmail web client...

Isn't this fun!

OT: Anyone know what the encoding story is with email? I'm sure the
original spec was ASCII only (probably 7 bit...), but are you now free
to use any (hopefully specified) encoding, or is it always UTF-8 or???

Just curious, really...

-CHB

···

On Wed, May 22, 2013 at 9:35 AM, Tim Roberts <timr@probo.com> wrote:

Now, hold on a minute. The last two characters here did show up as "a
accent" and "e accent", but in my Thunderbird, the two characters in your
original mail were the Hebrew letters "tet" and "alef" (U+05D8 and U+05D0).

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Chris Barker - NOAA Federal wrote:

···

On Wed, May 22, 2013 at 9:35 AM, Tim Roberts<timr@probo.com> wrote:

Now, hold on a minute. The last two characters here did show up as "a
accent" and "e accent", but in my Thunderbird, the two characters in your
original mail were the Hebrew letters "tet" and "alef" (U+05D8 and U+05D0).

Same for me in gmail web client...

Isn't this fun!

OT: Anyone know what the encoding story is with email? I'm sure the
original spec was ASCII only (probably 7 bit...), but are you now free
to use any (hopefully specified) encoding, or is it always UTF-8 or???

Just curious, really...

Many mail clients (or MUAs, "Mail User Agents") will let you choose from a large list of encodings, and the way the text is put into the "payload" of the email message as ascii is well defined by various RFCs. You can also dig in to the stock email package docs and code in the Python standard library and get all kinds of juicy details about it.

http://docs.python.org/2/library/email.html

--
Robin Dunn
Software Craftsman

Tim Roberts wrote:

werner wrote:

In my "Sent" folder in Thunderbird it shows as it should "a accent"
and "e accent" but the message which came in via google group is
showing garbage.

Lets see does it work all the way if I reply to this on google group.
áé

Now, hold on a minute. The last two characters here did show up as "a
accent" and "e accent", but in my Thunderbird, the two characters in
your original mail were the Hebrew letters "tet" and "alef" (U+05D8 and
U+05D0). What did you actually type?

I saw the Hebrew letters too. Perhaps there was some strangeness in the message encoding settings for the original message or the mail client? Anyway, Werner's first message used the UTF-8 encoding and the 2nd was ISO-8859-1, if that helps.

···

--
Robin Dunn
Software Craftsman

The same thing in the same way (both in Thunderbird and when it
worked in Firefox browser) which is hold down “alt” and then the
number 0025 for the “a accent” and 0233 for the “e accent” on
Windows 7 with a keyboard configured as “uk english” (you will love
it, actually have a french keyboard but never have it configured as
a french one).
Werner

···

Hi Tim,

  On 22/05/2013 18:35, Tim Roberts wrote:
  Now, hold on a minute.  The last two characters here did show up

as “a accent” and “e accent”, but in my Thunderbird, the two
characters in your original mail were the Hebrew letters “tet” and
“alef” (U+05D8 and U+05D0). What did you actually type?

werner wrote:

      On Wednesday, May 22, 2013 2:39:36 PM UTC+1, werner wrote:

Hi,

        I just noticed that the viewer has problems with special

characters such

        as �� etc etc.

How do I love all this encoding stuff.

      In my "Sent" folder in Thunderbird it shows as it should "a

accent" and “e accent” but the message which came in via
google group is showing garbage.

      Lets see does it work all the way if I reply to this on google

group.

      áé
Yes, when your attached pdf file is displayed in the viewer, the accented characters are all displayed incorrectly - see OriginalPdf.png. For example, e-acute is displayed as a dagger symbol. But, using the Wing debugger, the unicode strings in the PDF show the e-acute as \u2020 - which is indeed the Unicode character 'Dagger' u'L\'\u2020volution du mill\u2020sime 2000 confirme bien sa r\u2020putation de "Mill\u2020sime du si\u2021cle". Un tr\u2021s grand potentiel qui commence' Yet your attachment displays correctly in Adobe reader. Now if I cut and paste the text out of Adobe reader and inject as a comment in one of my recipes, then display it using the viewer, that all displays correctly as well - see PastedText.png. Wing now reports the unicode strings in the PDF as being like: u'\xe9volution du mill\xe9sime 2000 confirme bien sa' which is what I would expect. So, I think the viewer is behaving correctly as far as it goes - but I don't know what Adobe reader is doing to make it work with the original data. David

···

On 22/05/2013 15:29, David Hughes
wrote:

  On Wednesday, May 22, 2013 2:39:36 PM UTC+1, werner

wrote:

    Hi,




    I just noticed that the viewer has problems with special

characters

    The

problem can be seen when one opens the attached .pdf in the
viewer,

    if I view it with e.g. Adobe Reader the characters are fine.




    Is this to do with the way PythonReports defines/uses the fonts?




    Any hint on how to make this display correctly?




    This is with 2.9.5 preview (classic).




    Werner

Is this to do with the way PythonReports defines/uses the fonts?

Any hint on how to make this display correctly?

But, using the Wing
debugger, the unicode strings in the PDF show the e-acute as \u2020 - which
is indeed the Unicode character 'Dagger'

if you're using Wing, then this is the string after it was decoded
into a python unicode, object, yes?

In which case, the wrong encoding is being used to decode it.

So the question is, how are string encoded in PDF. From reading this thread:

That's a hard question to answer, but presumable it is either:

All PDF text is encoded with a particular encoding
or
There is a way to specify the encoding in a particular document.

I suspect it's the latter, or you would have this problem all the
time. It could also be that PythonReports is using the wrong encoding
or specifying it incorrectly but as Adobe Reader is the reference
implementation, to some extent, if it works in Reader, it's right.

So you need to figure out how reader determines the encoding, and
emulate that. Maybe the specs will help:

http://www.adobe.com/devnet/pdf/pdf_reference.html

So, I think the viewer is behaving correctly as far as it goes -

not really -- it's using the wrong encoding to decode the data in the
PDF -- that is not correct ( as long as you define correct as "same as
Adobe Reader" )

I'd make a tiny pdf with just a bit of non-ascii text in it, and take
a look at it. That may be easier than reading the spec!

-Chris

···

On Thu, May 23, 2013 at 8:58 AM, David Hughes <dfh@forestfield.co.uk> wrote:

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

BTW,

For what it's worth, Chrome's PDF viewer work OK too -- not sure if
that's Adobe under the hood....

-Chris

···

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

One more note....

$ grep -a Encoding Cellarbook\ wine\ list\ -\ portrait.pdf
/Encoding /WinAnsiEncoding

What is viewer using as an encoding when it decodes the pdf?

Though this may refer to encoding used for symbols in the PDF, rather
than text to display.

I also see this in there:

% 'toUnicodeCMap:AAAAAA+Arial-BoldMT': class PDFStream
7 0 obj
<< /Filter [ /FlateDecode ]
/Length 710 >>
stream
<<bunch of binary stuff....>>

Can't make much sense of that!

% Font Arial Bold subset 0
<< /BaseFont /AAAAAA+Arial-BoldMT
/FirstChar 0
/FontDescriptor 9 0 R
/LastChar 127
/Name /F2+0
/Subtype /TrueType
/ToUnicode 7 0 R

Given that I can't see any of the text in tehre when I look at it as
text (I think my terminal is set to utf-8), then it seems to be using
a multi-byte encoding of some sort -- but which one? (or it's
compressed or something -- I sure don't know anything about PDF...)

Again, I'd make a pdf with just a single paragraph of text and look at
/ experiment with that.

-Chris

···

On Thu, May 23, 2013 at 11:14 AM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:

BTW,

For what it's worth, Chrome's PDF viewer work OK too -- not sure if
that's Adobe under the hood....

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

The viewer itself doesn't do any encoding or decoding, it simply receives text strings from pyPdf and draws them in a wx.DC.

I agree though that encoding is the problem. The PastedText example I posted earlier works, I think, because it was written - via Reportlab - using one of the standard fonts (helvetica) that Adobe provides. Werner's pdf file contains references to AAAAAA+ArialMT, the definitions of which seem to be embedded in the file and which, I guess, his text is using.

The viewer doesn't currently handle embedded fonts (because I don't know how to do it at the moment) and the problem is most likely that they are encoded differently to the standard fonts

Werner, does PythonReports give you any choice which fonts you can use, i.e. can you restrict it to use of the Adobe standard fonts? This shouldn't make much difference to you in practice - Arial and Helvetica are pretty much the same thing. Alternatively, it might make a difference if the unicode(?) strings you pass it are encoded as 'latin-1'

Ideally, I would like to say that the viewer will be extended to handle embedded fonts, but I have no idea what work and time would be involved.

···

On 23/05/2013 19:12, Chris Barker - NOAA Federal wrote:

>So, I think the viewer is behaving correctly as far as it goes -

not really -- it's using the wrong encoding to decode the data in the
PDF -- that is not correct ( as long as you define correct as "same as
Adobe Reader" )

--
Regards

David Hughes
Forestfield Software

Hi David,

...

Werner, does PythonReports give you any choice which fonts you can use, i.e. can you restrict it to use of the Adobe standard fonts? This shouldn't make much difference to you in practice - Arial and Helvetica are pretty much the same thing. Alternatively, it might make a difference if the unicode(?) strings you pass it are encoded as 'latin-1'

I had a look at the font selection in the past but couldn't make it work then - will give it another go.

All my data comes via SQLAlchemy out of a Firebird SQL DB which uses "UTF-8" character set and SA 0.8 all the fields/columns use the "sa.Column(sa.Unicode(length=nn))" and don't do any encoding/decoding - so my guess is that it is also a font issue.

Ideally, I would like to say that the viewer will be extended to handle embedded fonts, but I have no idea what work and time would be involved.

I would be happy to test this;-)

Thanks for having looked at it.

Werner

···

On 24/05/2013 16:02, David Hughes wrote:

Chris Barker - NOAA Federal wrote:

For what it's worth, Chrome's PDF viewer work OK too -- not sure if
that's Adobe under the hood....

No. I was very surprised to learn that the built-in PDF viewer in
Firefox and Chrome is 100% Javascript, interpreted right there in the
browser. It's an open source component. I'm astonished that they are
able to do as good of a job as they do.

···

--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Hi David,

...

Werner, does PythonReports give you any choice which fonts you can use, i.e. can you restrict it to use of the Adobe standard fonts? This shouldn't make much difference to you in practice - Arial and Helvetica are pretty much the same thing. Alternatively, it might make a difference if the unicode(?) strings you pass it are encoded as 'latin-1'

If I use "Helvetica" in PythonReports then I get exceptions, but they are thrown from within ReportLab.

Could it be that the problem is within pyPdf and/or pyPDF2 (I normally use the later)?

I was experiementing a bit and see that there are display issues with PDF's generated by e.g. LibreOffice 4.x, very simple one I did is attached and here pdfviewer shows just a blank page.

Will try to dig around a bit more over the next few days.

Werner

testfromODF.pdf (13.8 KB)

testforodf.odt (8.75 KB)

···

On 24/05/2013 16:02, David Hughes wrote:

I have now got a version of pdfviewer that works for all types of PDF. Instead of pyPdf it uses python-fitz - the python bindings for the mupdf library, which does all the work of extraction and rendering of the PDF content.

I will be happy to provide you a copy of all the source code, Werner but I would like to ask - Robin in particular - about about the possibility of providing it as an addition to, or a replacement for, the current version of wx.lib.pdfviewer. My concern is that mupdf is released under GPL, specifically the GNU Affero General Public License version 3, and how this would affect the wxPython licence of pdfviewer and any software that uses it.

···

On 24/05/2013 15:39, werner wrote:

Hi David,

On 24/05/2013 16:02, David Hughes wrote:

...

Ideally, I would like to say that the viewer will be extended to handle embedded fonts, but I have no idea what work and time would be involved.

I would be happy to test this;-)

Thanks for having looked at it.

Werner

--
Regards

David Hughes
Forestfield Software

David Hughes wrote:

Hi David,

...

Ideally, I would like to say that the viewer will be extended to
handle embedded fonts, but I have no idea what work and time would be
involved.

I would be happy to test this;-)

Thanks for having looked at it.

Werner

I have now got a version of pdfviewer that works for all types of PDF.
Instead of pyPdf it uses python-fitz - the python bindings for the mupdf
library, which does all the work of extraction and rendering of the PDF
content.

I will be happy to provide you a copy of all the source code, Werner but
I would like to ask - Robin in particular - about about the possibility
of providing it as an addition to, or a replacement for, the current
version of wx.lib.pdfviewer. My concern is that mupdf is released under
GPL, specifically the GNU Affero General Public License version 3, and
how this would affect the wxPython licence of pdfviewer and any software
that uses it.

I've done a bit of research about this related to some work I've done at Enthought. IMO it basically boils down to this: since DLLs (and therefore Python extension modules) are, by their very nature, dynamically loaded at runtime then using GPL'd DLLs (or whatever) from a non-GPL'd program is allowed. What is still a very questionable issue (and most likely not allowed) is distributing the GPL'd DLLs or other binaries with the non-GPL'd program. In other words, using (dynamically loaded) GPL with non-GPL is okay at runtime, distributing GPL in binary form together with non-GPL is not okay. For example, if a developer used py2exe to create an application that included python-fitz and mupdf, then to be legally compliant the application would have to be GPL. The alternative is that the developer would have to provide a way for those to be downloaded and installed separately from their application, and make that installer and whatever support code it uses GPL too.

IANAL, this is just my interpretation, etc.

···

On 24/05/2013 15:39, werner wrote:

On 24/05/2013 16:02, David Hughes wrote:

--
Robin Dunn
Software Craftsman

As Robin IANAL.

It is a pity that they don't use the LGPL, which I believe does not have the above issues.

So in my view please don't replace the existing wx.lib.pdfviewer with this version, maybe have it as pdfviewer2 or pdfviewerAlt.

Werner

···

On 11/06/2013 23:48, Robin Dunn wrote:

David Hughes wrote:

On 24/05/2013 15:39, werner wrote:

Hi David,

On 24/05/2013 16:02, David Hughes wrote:

...

Ideally, I would like to say that the viewer will be extended to
handle embedded fonts, but I have no idea what work and time would be
involved.

I would be happy to test this;-)

Thanks for having looked at it.

Werner

I have now got a version of pdfviewer that works for all types of PDF.
Instead of pyPdf it uses python-fitz - the python bindings for the mupdf
library, which does all the work of extraction and rendering of the PDF
content.

I will be happy to provide you a copy of all the source code, Werner but
I would like to ask - Robin in particular - about about the possibility
of providing it as an addition to, or a replacement for, the current
version of wx.lib.pdfviewer. My concern is that mupdf is released under
GPL, specifically the GNU Affero General Public License version 3, and
how this would affect the wxPython licence of pdfviewer and any software
that uses it.

I've done a bit of research about this related to some work I've done at Enthought. IMO it basically boils down to this: since DLLs (and therefore Python extension modules) are, by their very nature, dynamically loaded at runtime then using GPL'd DLLs (or whatever) from a non-GPL'd program is allowed. What is still a very questionable issue (and most likely not allowed) is distributing the GPL'd DLLs or other binaries with the non-GPL'd program. In other words, using (dynamically loaded) GPL with non-GPL is okay at runtime, distributing GPL in binary form together with non-GPL is not okay. For example, if a developer used py2exe to create an application that included python-fitz and mupdf, then to be legally compliant the application would have to be GPL. The alternative is that the developer would have to provide a way for those to be downloaded and installed separately from their application, and make that installer and whatever support code it uses GPL too.

IANAL, this is just my interpretation, etc.

Has it been considered to make pdfviewer try several
backends and use the one that's available ?

Karsten

···

On Wed, Jun 12, 2013 at 08:45:19AM +0200, werner wrote:

So in my view please don't replace the existing wx.lib.pdfviewer with
this version, maybe have it as pdfviewer2 or pdfviewerAlt.

--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346