Converting pdf to text

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
jeff_lindqvist
Blue Belt
Posts: 722
Joined: Sun Aug 16, 2015 9:52 pm
Languages: sv, en
de, es
ga, eo
---
fi, yue, ro, tp, cy, kw, pt, sk
Language Log: viewtopic.php?f=15&t=2773
x 1068

Re: Converting pdf to text

Postby jeff_lindqvist » Mon Feb 27, 2017 10:21 pm

MorkTheFiddle wrote:Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.


What kind of PDF is it? A text/word document saved as PDF? An OCR scanned document? Some other kind of file, say, a photographed page, facsimile etc.?
1 x
Leabhair/Greannáin léite as Gaeilge: 9 / 18
Ar an seastán oíche: Oileán an Órchiste
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain : 100 / 100

User avatar
tommus
Green Belt
Posts: 280
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2), German (A1), Spanish (A1), Esperanto (A1)
x 454

Re: Converting pdf to text

Postby tommus » Mon Feb 27, 2017 10:41 pm

MorkTheFiddle wrote:I appreciate the suggestion, Tommus. Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.

If the PDF you are trying to convert is the one in the post above, that is a very complex document, with poor type, different size fonts, horizontal lines, footnotes, and not well aligned horizontally, etc. That makes it very difficult for any converter.

I suggest you use a image editor to edit each page by itself. Crop the image to just show the body text (not the headers, not the page numbers, not the horizontal lines, not the footnotes. Just plain text. Then store the image as a new pdf, one page per file. That will probably work much better.

I know it is a lot of work to do it like that, page for page. But I doubt if any converter is going to do it any other way.
1 x
Dutch
40 Boeken
● 35 Ned. Videos
● 370 Univ-Nederland
: 22 / 40
: 16 / 35
: 140 / 370
● 730 Video Nieuws
● 104 Skype NL Chats
● 730 Tekst Nieuws
: 430 / 730
: 63 / 104
: 430 / 730

dampingwire
Green Belt
Posts: 250
Joined: Tue Aug 04, 2015 8:11 pm
Location: Abingdon, UK
Languages: Italian (N), English (N), French (poor, not studying), Japanese (studying, JLPT N3)
x 162

Re: Converting pdf to text

Postby dampingwire » Mon Feb 27, 2017 11:48 pm

tommus wrote:
MorkTheFiddle wrote:
I suggest you use a image editor to edit each page by itself. Crop the image to just show the body text (not the headers, not the page numbers, not the horizontal lines, not the footnotes. Just plain text. Then store the image as a new pdf, one page per file. That will probably work much better.

I know it is a lot of work to do it like that, page for page. But I doubt if any converter is going to do it any other way.


Assuming the pages are all roughly the same (or at least all the odd ones are similar and all the even ones are similar) you can probably save yourself a lot of work by using something like imagemagick to crop each page in one go. You can probably save yourself even more work by doing one page manually and then seeing how that one page converts before putting a heap of effort in.

It is possible to embed fonts into PDFs. That means that some characters map into a specific glyph so that it looks right on the screen but the encoding is basically "hex-value to picture", which means you have no chance of producing text (at least not without some fancy OCR ...).
1 x
新完全マスター N2聴解 : 94 / 103新完全マスター N2読解 : 99 / 177
新完全マスター N2文法 : 197 / 197TY Comp. German : 0 / 389

User avatar
MorkTheFiddle
Green Belt
Posts: 270
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently), Ancient Greek (abandoned), Latin (abandoned). Once studied Old Norse.
Language Log: http://tinyurl.com/zcx4ogt
x 276

Re: Converting pdf to text

Postby MorkTheFiddle » Tue Feb 28, 2017 7:32 pm

jeff_lindqvist wrote:
MorkTheFiddle wrote:Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.


What kind of PDF is it? A text/word document saved as PDF? An OCR scanned document? Some other kind of file, say, a photographed page, facsimile etc.?

Jeff, it is a Google OCR scan, and Tommus has described it accurately: “If the PDF you are trying to convert is the one in the post above, that is a very complex document, with poor type, different size fonts, horizontal lines, footnotes, and not well aligned horizontally, etc. That makes it very difficult for any converter.” And yes, the PDF I am trying to convert is in fact the one in the post above.

Dampingwire may well be correct that the fonts have been embedded and cannot be extracted into legible text.

My goal was to set up the text and the “glossary” side by side on one page. If I do the re-imaging as suggested, I might get a workable sheet. So I’m going to give it a shot and see what happens.

And there is still a chance that Adrianslont can come up with something.

Thanks to all for your imput.
0 x

User avatar
rdearman
Site Admin
Posts: 2075
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
French (studies), Italian (studies), Mandarin (studies),
Esperanto TAC (Only god knows why), Finnish (only in it for the cookies)
Language Log: viewtopic.php?f=15&t=1836
x 3931
Contact:

Re: Converting pdf to text

Postby rdearman » Tue Feb 28, 2017 7:44 pm

If you are going to chop it into images then I recommend PDFTK (stands for pdf toolkit) https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ with will chop the pdf into pages, and also combine them back later. There is a gui version, and a commandline one.
2 x

User avatar
MorkTheFiddle
Green Belt
Posts: 270
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently), Ancient Greek (abandoned), Latin (abandoned). Once studied Old Norse.
Language Log: http://tinyurl.com/zcx4ogt
x 276

Re: Converting pdf to text

Postby MorkTheFiddle » Wed Mar 01, 2017 8:53 pm

rdearman wrote:If you are going to chop it into images then I recommend PDFTK (stands for pdf toolkit) https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ with will chop the pdf into pages, and also combine them back later. There is a gui version, and a commandline one.

Thanks for suggesting this handy tool.
However, if I decide to continue with this project, I'm going to have the pdf typed, either by me (unlikely) or a service. A preliminary surf found prices from 50 cents a page to $2 per page. I don't need all the pages, only 75 of them. The Greek could throw a clinker in the works, but the Greek does not have to be typed, merely indicated in some way.
I'll report back on how it goes.
0 x

Doitsujin
Yellow Belt
Posts: 76
Joined: Sat Jul 18, 2015 6:21 pm
Languages: German (N)
x 117

Re: Converting pdf to text

Postby Doitsujin » Thu Mar 02, 2017 12:13 pm

tommus wrote:I suggest you use a image editor to edit each page by itself. Crop the image to just show the body text (not the headers, not the page numbers, not the horizontal lines, not the footnotes. Just plain text. Then store the image as a new pdf, one page per file. That will probably work much better.
I know it is a lot of work to do it like that, page for page. But I doubt if any converter is going to do it any other way.
Have you ever tried ScanTailor (freeware)? It'll do many of these steps automatically with limited user input.
The only downside is that the pdf needs to be converted to image files first.
2 x

User avatar
tommus
Green Belt
Posts: 280
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2), German (A1), Spanish (A1), Esperanto (A1)
x 454

Re: Converting pdf to text

Postby tommus » Thu Mar 02, 2017 12:57 pm

Doitsujin wrote:Have you ever tried ScanTailor (freeware)? It'll do many of these steps automatically with limited user input. The only downside is that the pdf needs to be converted to image files first.

No. I have never heard of ScanTailor, but it looks very interesting. If it works well, then it will surely save a lot of time. My scanner will save in either pdf or jpg, so I can scan books for example directly into images and save that pdf-image conversion, although that is not too time consuming even for material already in pdf.

I will give it a try right away. There goes quite a bit of my L2 study time for this morning. That is the problem. There are so many interesting ways to accumulate good second language study material that there is actually very little time left to actually study it.
1 x
Dutch
40 Boeken
● 35 Ned. Videos
● 370 Univ-Nederland
: 22 / 40
: 16 / 35
: 140 / 370
● 730 Video Nieuws
● 104 Skype NL Chats
● 730 Tekst Nieuws
: 430 / 730
: 63 / 104
: 430 / 730

User avatar
tommus
Green Belt
Posts: 280
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2), German (A1), Spanish (A1), Esperanto (A1)
x 454

Re: Converting pdf to text

Postby tommus » Thu Mar 02, 2017 3:30 pm

OK. I just tried ScanTailor and I am impressed. It worked very well.

First, I scanned the first chapter of a Dutch book, two pages at a time, saving the pages as JPG. I then ran ScanTailor. The interface is a bit unusual but logical once you work through it for the first time. It has a lot of automation, with optional manual intervention if required (such as minor adjustments). My book has large pages with good layout and fonts, so the "auto" worked. ScanTailor will rotate the two-page scans, select the divide between the two pages and determine the text area automatically. One disappointment was that the output was TIF-format, not PDF. So you have to convert the TIFs to PDFs. There are lots of online apps for that. I used my offline IrfanView (which is free and excellent).

Then I completed the conversion to plain text using the process I described above, by dragging the 9 PDFs into Google Drive, and then opening them in Google Docs.

Excellent results.

Thanks to Doitsujin for recommending ScanTailor.
1 x
Dutch
40 Boeken
● 35 Ned. Videos
● 370 Univ-Nederland
: 22 / 40
: 16 / 35
: 140 / 370
● 730 Video Nieuws
● 104 Skype NL Chats
● 730 Tekst Nieuws
: 430 / 730
: 63 / 104
: 430 / 730

User avatar
rdearman
Site Admin
Posts: 2075
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
French (studies), Italian (studies), Mandarin (studies),
Esperanto TAC (Only god knows why), Finnish (only in it for the cookies)
Language Log: viewtopic.php?f=15&t=1836
x 3931
Contact:

Re: Converting pdf to text

Postby rdearman » Thu Mar 02, 2017 5:01 pm

tommus wrote:OK. I just tried ScanTailor and I am impressed. It worked very well.

First, I scanned the first chapter of a Dutch book, two pages at a time, saving the pages as JPG. I then ran ScanTailor. The interface is a bit unusual but logical once you work through it for the first time. It has a lot of automation, with optional manual intervention if required (such as minor adjustments). My book has large pages with good layout and fonts, so the "auto" worked. ScanTailor will rotate the two-page scans, select the divide between the two pages and determine the text area automatically. One disappointment was that the output was TIF-format, not PDF. So you have to convert the TIFs to PDFs. There are lots of online apps for that. I used my offline IrfanView (which is free and excellent).

Then I completed the conversion to plain text using the process I described above, by dragging the 9 PDFs into Google Drive, and then opening them in Google Docs.

Excellent results.

Thanks to Doitsujin for recommending ScanTailor.


You can use imagemagik for the conversion and do them in mass. http://www.imagemagick.org/script/index.php
0 x


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: No registered users and 3 guests