spaces between almost every character, somewhat scrambled select order
Brought to you by:
tobias-elze
I initially reported this issue here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=802233#5 in case this issue was specific to debian.
Tobias told me how to make a debug log, so I'm attaching that here.
The command I executed: pdfsandwich -debug -verbose -layout double galton.pdf > pdfsandwich.log 2>&1
oh, then I ran: pdftotext galton_ocr.pdf
Which produced galton_ocr.txt (attached above)
As you can see, there are spaces between almost every character, and much of the text is out of order.
The big stretch in the begging that has one character on each line, is most of the letters from the original from the second line in the left column, but in reverse order.
Indeed, there's something wrong. You should have one temp file on your disk now: /tmp/pdfsandwichdaf373.pdf
This is the file generated by tesseract before ghostscript becomes active. Could you open that file and check if it has the same problems?
ooo! the tmp file appears to have the text in the correct order, without extranious spaces! attached
So it's ghostscript which messes up things. Well, maybe that's not a fair statement, as there's a discussion on the ghostscript but report page that tesseract produces broken pdfs which look somehow alright until you do any further processing with them (such as by ghostscript).
One reason why I use ghostscript is to downscale tesseract's pdf to the original paper size - as you can see, the temporary file has an extraordinarily large paper size, way larger than the original pdf. Clearly a tesseract bug, and the reason why it's not so easy to omit ghostscript at the moment. I'll think of some other solution.
I wonder if the change suggested in bug #9 Use less ghostscript would fix this bug.
I'm happy to do further testing if you tell me exactly what to do :)
Yes, the "use less ghostscript" bug would fit that indeed - or rather the "use no ghostcript" bug :)
On the other hand, it is true that tesseract produces broken pdfs in two ways: One issue is the huge page size, another issue is their embedding of unicode-fonts (which is another pdfsandwich bug here). Therefore, I'm still hesitating to replace ghostscript, since if I replace it with other (and less well known) software, who knows which other problems might occur.
The best way for all of us would be to have the tesseract guys fixing their pdf bugs.
Just found this same bug in Mac OS X, using the SVN trunk version of pdfsandwich. The temp, single-page searchable PDF generated by tesseract 3.04.00 seems to contain the correct text when opened in Adobe Acrobat Reader, but the final PDF adds spaces and the text is kinda out of order.
Also, Preview.app only sees spaces in any of the generated PDFs, but that seems to be a Tesseract / Preview problem.
Last edit: hmijail 2016-07-19
FWIW, I just retried with tesseract 3.04.01, which includes some partial fix for this kind of issue. Now Preview.app is able to see some text, but it's so messed up that it's useless. Worse, even using Adobe Acrobat on the just-after-Tesseract debug files renders a text interspersed with spaces.
Looks like before Tesseract 3.03, the creation of searchable PDFs used hocr2pdf. Did the PDFs so created behave better? If so, hinting to use that route in the webpage and manual would be good, until the Tesseract problems get solved.
I've been able to overcome this issue by:
1) Converting ppm/pbm to tiff because ppm/pbm format is not able to hold density (resolution) information in its header. That is why tesseract generates, as Tobias put it, "huge page size" - without density information it simply has no way to determine the page size. See [1] for more info.
2) Setting density of the tiff file using mogrify. That way tesseract is able to set correct page size of the resulting pdf - then there's no need to use gs to downscale the pdf.
3) Using pdfunite instead of gs to combine the single pdf pages.
The resulting pdf is now searchable and the OCR text is without spaces between each character.
I'm not familiar with the ml language so I've made some quick&dirty changes to pdfsandwich.ml based on the surrounding code. I'm attaching the file so anybody having this issue can try it out.
[1] https://github.com/tesseract-ocr/tesseract/issues/150
Great, Michal, the tif conversion really seems to do the trick. We don't need mogrify for this, we can use convert for it, which is used anyway by tesseract. I've replaced ghostscript now - it's only needed now if the pdf pages are to be resized (which is an optional command line flag).
I've attached the tentative new version 1.5 as a deb package. All of you contributing to this thread, could you give me feedback if that works? Then I'll officially publish it.
Thanks,
Tobias
It works nice for me (trunk version 62), Tobias. Thanks.
I needed to make one change though: checking unpaper version by calling "unpaper --version" instead of "unpaper -version", otherwise I was getting an error:
error: Unknown parameter '-version'.
Try 'unpaper --help' for options.
ERROR: Command "unpaper -version" failed. Terminating pdfsandwich. All temporary files are kept.
I'm using Fedora 24 so maybe it's distribution related. I needed to do this fix even before, as seen in the attachment from my previous comment.
Oh, thanks for pointing this out. It seems unpaper -V (capital V) works with all versions of unpaper, regardless of -version or --version. Can you confirm this?
Confirmed. "unpaper -V" works on Fedora 24.