Page 2 of 3

Re: Converting pdf to text

Posted: Mon Feb 27, 2017 10:21 pm
by jeff_lindqvist
MorkTheFiddle wrote:Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.


What kind of PDF is it? A text/word document saved as PDF? An OCR scanned document? Some other kind of file, say, a photographed page, facsimile etc.?

Re: Converting pdf to text

Posted: Mon Feb 27, 2017 10:41 pm
by tommus
MorkTheFiddle wrote:I appreciate the suggestion, Tommus. Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.

If the PDF you are trying to convert is the one in the post above, that is a very complex document, with poor type, different size fonts, horizontal lines, footnotes, and not well aligned horizontally, etc. That makes it very difficult for any converter.

I suggest you use a image editor to edit each page by itself. Crop the image to just show the body text (not the headers, not the page numbers, not the horizontal lines, not the footnotes. Just plain text. Then store the image as a new pdf, one page per file. That will probably work much better.

I know it is a lot of work to do it like that, page for page. But I doubt if any converter is going to do it any other way.

Re: Converting pdf to text

Posted: Mon Feb 27, 2017 11:48 pm
by dampingwire
tommus wrote:
MorkTheFiddle wrote:
I suggest you use a image editor to edit each page by itself. Crop the image to just show the body text (not the headers, not the page numbers, not the horizontal lines, not the footnotes. Just plain text. Then store the image as a new pdf, one page per file. That will probably work much better.

I know it is a lot of work to do it like that, page for page. But I doubt if any converter is going to do it any other way.


Assuming the pages are all roughly the same (or at least all the odd ones are similar and all the even ones are similar) you can probably save yourself a lot of work by using something like imagemagick to crop each page in one go. You can probably save yourself even more work by doing one page manually and then seeing how that one page converts before putting a heap of effort in.

It is possible to embed fonts into PDFs. That means that some characters map into a specific glyph so that it looks right on the screen but the encoding is basically "hex-value to picture", which means you have no chance of producing text (at least not without some fancy OCR ...).

Re: Converting pdf to text

Posted: Tue Feb 28, 2017 7:32 pm
by MorkTheFiddle
jeff_lindqvist wrote:
MorkTheFiddle wrote:Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.


What kind of PDF is it? A text/word document saved as PDF? An OCR scanned document? Some other kind of file, say, a photographed page, facsimile etc.?

Jeff, it is a Google OCR scan, and Tommus has described it accurately: “If the PDF you are trying to convert is the one in the post above, that is a very complex document, with poor type, different size fonts, horizontal lines, footnotes, and not well aligned horizontally, etc. That makes it very difficult for any converter.” And yes, the PDF I am trying to convert is in fact the one in the post above.

Dampingwire may well be correct that the fonts have been embedded and cannot be extracted into legible text.

My goal was to set up the text and the “glossary” side by side on one page. If I do the re-imaging as suggested, I might get a workable sheet. So I’m going to give it a shot and see what happens.

And there is still a chance that Adrianslont can come up with something.

Thanks to all for your imput.

Re: Converting pdf to text

Posted: Tue Feb 28, 2017 7:44 pm
by rdearman
If you are going to chop it into images then I recommend PDFTK (stands for pdf toolkit) https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ with will chop the pdf into pages, and also combine them back later. There is a gui version, and a commandline one.

Re: Converting pdf to text

Posted: Wed Mar 01, 2017 8:53 pm
by MorkTheFiddle
rdearman wrote:If you are going to chop it into images then I recommend PDFTK (stands for pdf toolkit) https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ with will chop the pdf into pages, and also combine them back later. There is a gui version, and a commandline one.

Thanks for suggesting this handy tool.
However, if I decide to continue with this project, I'm going to have the pdf typed, either by me (unlikely) or a service. A preliminary surf found prices from 50 cents a page to $2 per page. I don't need all the pages, only 75 of them. The Greek could throw a clinker in the works, but the Greek does not have to be typed, merely indicated in some way.
I'll report back on how it goes.

Re: Converting pdf to text

Posted: Thu Mar 02, 2017 12:13 pm
by Doitsujin
tommus wrote:I suggest you use a image editor to edit each page by itself. Crop the image to just show the body text (not the headers, not the page numbers, not the horizontal lines, not the footnotes. Just plain text. Then store the image as a new pdf, one page per file. That will probably work much better.
I know it is a lot of work to do it like that, page for page. But I doubt if any converter is going to do it any other way.
Have you ever tried ScanTailor (freeware)? It'll do many of these steps automatically with limited user input.
The only downside is that the pdf needs to be converted to image files first.

Re: Converting pdf to text

Posted: Thu Mar 02, 2017 12:57 pm
by tommus
Doitsujin wrote:Have you ever tried ScanTailor (freeware)? It'll do many of these steps automatically with limited user input. The only downside is that the pdf needs to be converted to image files first.

No. I have never heard of ScanTailor, but it looks very interesting. If it works well, then it will surely save a lot of time. My scanner will save in either pdf or jpg, so I can scan books for example directly into images and save that pdf-image conversion, although that is not too time consuming even for material already in pdf.

I will give it a try right away. There goes quite a bit of my L2 study time for this morning. That is the problem. There are so many interesting ways to accumulate good second language study material that there is actually very little time left to actually study it.

Re: Converting pdf to text

Posted: Thu Mar 02, 2017 3:30 pm
by tommus
OK. I just tried ScanTailor and I am impressed. It worked very well.

First, I scanned the first chapter of a Dutch book, two pages at a time, saving the pages as JPG. I then ran ScanTailor. The interface is a bit unusual but logical once you work through it for the first time. It has a lot of automation, with optional manual intervention if required (such as minor adjustments). My book has large pages with good layout and fonts, so the "auto" worked. ScanTailor will rotate the two-page scans, select the divide between the two pages and determine the text area automatically. One disappointment was that the output was TIF-format, not PDF. So you have to convert the TIFs to PDFs. There are lots of online apps for that. I used my offline IrfanView (which is free and excellent).

Then I completed the conversion to plain text using the process I described above, by dragging the 9 PDFs into Google Drive, and then opening them in Google Docs.

Excellent results.

Thanks to Doitsujin for recommending ScanTailor.

Re: Converting pdf to text

Posted: Thu Mar 02, 2017 5:01 pm
by rdearman
tommus wrote:OK. I just tried ScanTailor and I am impressed. It worked very well.

First, I scanned the first chapter of a Dutch book, two pages at a time, saving the pages as JPG. I then ran ScanTailor. The interface is a bit unusual but logical once you work through it for the first time. It has a lot of automation, with optional manual intervention if required (such as minor adjustments). My book has large pages with good layout and fonts, so the "auto" worked. ScanTailor will rotate the two-page scans, select the divide between the two pages and determine the text area automatically. One disappointment was that the output was TIF-format, not PDF. So you have to convert the TIFs to PDFs. There are lots of online apps for that. I used my offline IrfanView (which is free and excellent).

Then I completed the conversion to plain text using the process I described above, by dragging the 9 PDFs into Google Drive, and then opening them in Google Docs.

Excellent results.

Thanks to Doitsujin for recommending ScanTailor.


You can use imagemagik for the conversion and do them in mass. http://www.imagemagick.org/script/index.php