Converting pdf to text

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
jeff_lindqvist
Black Belt - 3rd Dan
Posts: 3135
Joined: Sun Aug 16, 2015 9:52 pm
Languages: sv, en
de, es
ga, eo
---
fi, yue, ro, tp, cy, kw, pt, sk
Language Log: viewtopic.php?f=15&t=2773
x 10462

Re: Converting pdf to text

Postby jeff_lindqvist » Mon Feb 27, 2017 10:21 pm

MorkTheFiddle wrote:Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.


What kind of PDF is it? A text/word document saved as PDF? An OCR scanned document? Some other kind of file, say, a photographed page, facsimile etc.?
1 x
Leabhair/Greannáin léite as Gaeilge: 9 / 18
Ar an seastán oíche: Oileán an Órchiste
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain : 100 / 100

Llorg Blog - Wiki - Discord

User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Converting pdf to text

Postby tommus » Mon Feb 27, 2017 10:41 pm

MorkTheFiddle wrote:I appreciate the suggestion, Tommus. Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.

If the PDF you are trying to convert is the one in the post above, that is a very complex document, with poor type, different size fonts, horizontal lines, footnotes, and not well aligned horizontally, etc. That makes it very difficult for any converter.

I suggest you use a image editor to edit each page by itself. Crop the image to just show the body text (not the headers, not the page numbers, not the horizontal lines, not the footnotes. Just plain text. Then store the image as a new pdf, one page per file. That will probably work much better.

I know it is a lot of work to do it like that, page for page. But I doubt if any converter is going to do it any other way.
1 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000

dampingwire
Blue Belt
Posts: 559
Joined: Tue Aug 04, 2015 8:11 pm
Location: Abingdon, UK
Languages: Italian (N), English (N), French (poor, not studying), Japanese (studying, JLPT N3)
x 609

Re: Converting pdf to text

Postby dampingwire » Mon Feb 27, 2017 11:48 pm

tommus wrote:
MorkTheFiddle wrote:
I suggest you use a image editor to edit each page by itself. Crop the image to just show the body text (not the headers, not the page numbers, not the horizontal lines, not the footnotes. Just plain text. Then store the image as a new pdf, one page per file. That will probably work much better.

I know it is a lot of work to do it like that, page for page. But I doubt if any converter is going to do it any other way.


Assuming the pages are all roughly the same (or at least all the odd ones are similar and all the even ones are similar) you can probably save yourself a lot of work by using something like imagemagick to crop each page in one go. You can probably save yourself even more work by doing one page manually and then seeing how that one page converts before putting a heap of effort in.

It is possible to embed fonts into PDFs. That means that some characters map into a specific glyph so that it looks right on the screen but the encoding is basically "hex-value to picture", which means you have no chance of producing text (at least not without some fancy OCR ...).
1 x
新完全マスター N2聴解 : 94 / 103新完全マスター N2読解 : 99 / 177
新完全マスター N2文法 : 197 / 197TY Comp. German : 0 / 389

User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2114
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4824

Re: Converting pdf to text

Postby MorkTheFiddle » Tue Feb 28, 2017 7:32 pm

jeff_lindqvist wrote:
MorkTheFiddle wrote:Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.


What kind of PDF is it? A text/word document saved as PDF? An OCR scanned document? Some other kind of file, say, a photographed page, facsimile etc.?

Jeff, it is a Google OCR scan, and Tommus has described it accurately: “If the PDF you are trying to convert is the one in the post above, that is a very complex document, with poor type, different size fonts, horizontal lines, footnotes, and not well aligned horizontally, etc. That makes it very difficult for any converter.” And yes, the PDF I am trying to convert is in fact the one in the post above.

Dampingwire may well be correct that the fonts have been embedded and cannot be extracted into legible text.

My goal was to set up the text and the “glossary” side by side on one page. If I do the re-imaging as suggested, I might get a workable sheet. So I’m going to give it a shot and see what happens.

And there is still a chance that Adrianslont can come up with something.

Thanks to all for your imput.
0 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23127
Contact:

Re: Converting pdf to text

Postby rdearman » Tue Feb 28, 2017 7:44 pm

If you are going to chop it into images then I recommend PDFTK (stands for pdf toolkit) https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ with will chop the pdf into pages, and also combine them back later. There is a gui version, and a commandline one.
2 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2114
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4824

Re: Converting pdf to text

Postby MorkTheFiddle » Wed Mar 01, 2017 8:53 pm

rdearman wrote:If you are going to chop it into images then I recommend PDFTK (stands for pdf toolkit) https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ with will chop the pdf into pages, and also combine them back later. There is a gui version, and a commandline one.

Thanks for suggesting this handy tool.
However, if I decide to continue with this project, I'm going to have the pdf typed, either by me (unlikely) or a service. A preliminary surf found prices from 50 cents a page to $2 per page. I don't need all the pages, only 75 of them. The Greek could throw a clinker in the works, but the Greek does not have to be typed, merely indicated in some way.
I'll report back on how it goes.
0 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson

Doitsujin
Green Belt
Posts: 402
Joined: Sat Jul 18, 2015 6:21 pm
Languages: German (N)
x 801

Re: Converting pdf to text

Postby Doitsujin » Thu Mar 02, 2017 12:13 pm

tommus wrote:I suggest you use a image editor to edit each page by itself. Crop the image to just show the body text (not the headers, not the page numbers, not the horizontal lines, not the footnotes. Just plain text. Then store the image as a new pdf, one page per file. That will probably work much better.
I know it is a lot of work to do it like that, page for page. But I doubt if any converter is going to do it any other way.
Have you ever tried ScanTailor (freeware)? It'll do many of these steps automatically with limited user input.
The only downside is that the pdf needs to be converted to image files first.
2 x

User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Converting pdf to text

Postby tommus » Thu Mar 02, 2017 12:57 pm

Doitsujin wrote:Have you ever tried ScanTailor (freeware)? It'll do many of these steps automatically with limited user input. The only downside is that the pdf needs to be converted to image files first.

No. I have never heard of ScanTailor, but it looks very interesting. If it works well, then it will surely save a lot of time. My scanner will save in either pdf or jpg, so I can scan books for example directly into images and save that pdf-image conversion, although that is not too time consuming even for material already in pdf.

I will give it a try right away. There goes quite a bit of my L2 study time for this morning. That is the problem. There are so many interesting ways to accumulate good second language study material that there is actually very little time left to actually study it.
2 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000

User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Converting pdf to text

Postby tommus » Thu Mar 02, 2017 3:30 pm

OK. I just tried ScanTailor and I am impressed. It worked very well.

First, I scanned the first chapter of a Dutch book, two pages at a time, saving the pages as JPG. I then ran ScanTailor. The interface is a bit unusual but logical once you work through it for the first time. It has a lot of automation, with optional manual intervention if required (such as minor adjustments). My book has large pages with good layout and fonts, so the "auto" worked. ScanTailor will rotate the two-page scans, select the divide between the two pages and determine the text area automatically. One disappointment was that the output was TIF-format, not PDF. So you have to convert the TIFs to PDFs. There are lots of online apps for that. I used my offline IrfanView (which is free and excellent).

Then I completed the conversion to plain text using the process I described above, by dragging the 9 PDFs into Google Drive, and then opening them in Google Docs.

Excellent results.

Thanks to Doitsujin for recommending ScanTailor.
1 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23127
Contact:

Re: Converting pdf to text

Postby rdearman » Thu Mar 02, 2017 5:01 pm

tommus wrote:OK. I just tried ScanTailor and I am impressed. It worked very well.

First, I scanned the first chapter of a Dutch book, two pages at a time, saving the pages as JPG. I then ran ScanTailor. The interface is a bit unusual but logical once you work through it for the first time. It has a lot of automation, with optional manual intervention if required (such as minor adjustments). My book has large pages with good layout and fonts, so the "auto" worked. ScanTailor will rotate the two-page scans, select the divide between the two pages and determine the text area automatically. One disappointment was that the output was TIF-format, not PDF. So you have to convert the TIFs to PDFs. There are lots of online apps for that. I used my offline IrfanView (which is free and excellent).

Then I completed the conversion to plain text using the process I described above, by dragging the 9 PDFs into Google Drive, and then opening them in Google Docs.

Excellent results.

Thanks to Doitsujin for recommending ScanTailor.


You can use imagemagik for the conversion and do them in mass. http://www.imagemagick.org/script/index.php
0 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: No registered users and 2 guests