pdfsandwich / Bugs / #10 spaces between almost every character, somewhat scrambled select order

Jason Woofenden - 2015-11-03

oh, then I ran: pdftotext galton_ocr.pdf

Which produced galton_ocr.txt (attached above)

As you can see, there are spaces between almost every character, and much of the text is out of order.

The big stretch in the begging that has one character on each line, is most of the letters from the original from the second line in the left column, but in reverse order.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tobias Elze - 2015-11-03

Indeed, there's something wrong. You should have one temp file on your disk now: /tmp/pdfsandwichdaf373.pdf
This is the file generated by tesseract before ghostscript becomes active. Could you open that file and check if it has the same problems?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason Woofenden - 2015-11-04

ooo! the tmp file appears to have the text in the correct order, without extranious spaces! attached

pdfsandwichdaf373.pdf

pdfsandwichdaf373.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tobias Elze - 2015-11-04

So it's ghostscript which messes up things. Well, maybe that's not a fair statement, as there's a discussion on the ghostscript but report page that tesseract produces broken pdfs which look somehow alright until you do any further processing with them (such as by ghostscript).

One reason why I use ghostscript is to downscale tesseract's pdf to the original paper size - as you can see, the temporary file has an extraordinarily large paper size, way larger than the original pdf. Clearly a tesseract bug, and the reason why it's not so easy to omit ghostscript at the moment. I'll think of some other solution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason Woofenden - 2015-11-04

I wonder if the change suggested in bug #9 Use less ghostscript would fix this bug.

I'm happy to do further testing if you tell me exactly what to do :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tobias Elze - 2015-11-04

Yes, the "use less ghostscript" bug would fit that indeed - or rather the "use no ghostcript" bug :)

On the other hand, it is true that tesseract produces broken pdfs in two ways: One issue is the huge page size, another issue is their embedding of unicode-fonts (which is another pdfsandwich bug here). Therefore, I'm still hesitating to replace ghostscript, since if I replace it with other (and less well known) software, who knows which other problems might occur.

The best way for all of us would be to have the tesseract guys fixing their pdf bugs.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hmijail - 2016-07-19

Just found this same bug in Mac OS X, using the SVN trunk version of pdfsandwich. The temp, single-page searchable PDF generated by tesseract 3.04.00 seems to contain the correct text when opened in Adobe Acrobat Reader, but the final PDF adds spaces and the text is kinda out of order.

Also, Preview.app only sees spaces in any of the generated PDFs, but that seems to be a Tesseract / Preview problem.

Last edit: hmijail 2016-07-19

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hmijail - 2016-07-19

FWIW, I just retried with tesseract 3.04.01, which includes some partial fix for this kind of issue. Now Preview.app is able to see some text, but it's so messed up that it's useless. Worse, even using Adobe Acrobat on the just-after-Tesseract debug files renders a text interspersed with spaces.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hmijail - 2016-07-19

Looks like before Tesseract 3.03, the creation of searchable PDFs used hocr2pdf. Did the PDFs so created behave better? If so, hinting to use that route in the webpage and manual would be good, until the Tesseract problems get solved.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michal Bocek - 2016-08-03

I've been able to overcome this issue by:
1) Converting ppm/pbm to tiff because ppm/pbm format is not able to hold density (resolution) information in its header. That is why tesseract generates, as Tobias put it, "huge page size" - without density information it simply has no way to determine the page size. See [1] for more info.
2) Setting density of the tiff file using mogrify. That way tesseract is able to set correct page size of the resulting pdf - then there's no need to use gs to downscale the pdf.
3) Using pdfunite instead of gs to combine the single pdf pages.

The resulting pdf is now searchable and the OCR text is without spaces between each character.

I'm not familiar with the ml language so I've made some quick&dirty changes to pdfsandwich.ml based on the surrounding code. I'm attaching the file so anybody having this issue can try it out.

[1] https://github.com/tesseract-ocr/tesseract/issues/150

pdfsandwich.ml

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tobias Elze - 2016-08-04
  
  Great, Michal, the tif conversion really seems to do the trick. We don't need mogrify for this, we can use convert for it, which is used anyway by tesseract. I've replaced ghostscript now - it's only needed now if the pdf pages are to be resized (which is an optional command line flag).
  
  I've attached the tentative new version 1.5 as a deb package. All of you contributing to this thread, could you give me feedback if that works? Then I'll officially publish it.
  
  Thanks,
  Tobias
  
  pdfsandwich_0.1.5_amd64.deb
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michal Bocek - 2016-08-04

It works nice for me (trunk version 62), Tobias. Thanks.
I needed to make one change though: checking unpaper version by calling "unpaper --version" instead of "unpaper -version", otherwise I was getting an error:

error: Unknown parameter '-version'.
Try 'unpaper --help' for options.
ERROR: Command "unpaper -version" failed. Terminating pdfsandwich. All temporary files are kept.

I'm using Fedora 24 so maybe it's distribution related. I needed to do this fix even before, as seen in the attachment from my previous comment.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tobias Elze - 2016-08-04
  
  Oh, thanks for pointing this out. It seems unpaper -V (capital V) works with all versions of unpaper, regardless of -version or --version. Can you confirm this?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Michal Bocek - 2016-08-04
    
    Confirmed. "unpaper -V" works on Fedora 24.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tobias Elze - 2016-08-05

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

spaces between almost every character, somewhat scrambled select order

Group

Searches

Help

#10 spaces between almost every character, somewhat scrambled select order

Discussion