Converting pdf to text

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
MorkTheFiddle
Green Belt
Posts: 267
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently), Ancient Greek (abandoned), Latin (abandoned). Once studied Old Norse.
Language Log: http://tinyurl.com/zcx4ogt
x 272

Converting pdf to text

Postby MorkTheFiddle » Sat Feb 25, 2017 10:34 pm

Perhaps not a proper topic for this forum, but I am having a hard time converting a pdf copy of a work about the Iliad from the 19th century. The work lists in order each word in the first book of the Iliad and gives the definition, part of speech and sometimes other information about the word. Reading Eleanor Dickey's Learning Latin the Ancient Way (first mentioned by Tommus in this forum: see below) gave me the idea that a list of the vocabulary listed in order could be given for other ancient works, especially if all the hard work had been done by a dead white man of a previous century :) . The work in question is Parsing Lessons to Homer's Iliad Book I, 4th edition, whose Google Books url I give below (couldn't find it at archive.org).

Extracting the Greek from the text would be a bonus, but the English text interests me far more. After trying a number of different paths, I finally had to resort to the doyen of pdfs, Adobe. Rather than mortgage my house to buy a copy, I downloaded a trial copy of Adobe and put it to work. The results are not spectacular. Only the html version is of any use. The output is a fair rendition of the English text, but the Greek text turned into nonsense, which makes reading the rest of the text difficult.
I suppose beggars can't be choosers, and what came out is better than nothing (and better than retyping the text by hand (maybe)), but I would like to know of a better method.

BTW, if I finish this project [a big if], it will be free for the asking, not a printed-on-demand thing costing $15 on the Internet.

Finally, I don't see why this method of presenting elementary material can't work for any language.

The first post about Dickey's book was by Tommus http://forum.language-learners.org/viewtopic.php?f=14&t=3369&p=64459&hilit=dickey#p64459.
The pdf I am using can be located here: https://books.google.com/books?id=wj9WAAAAcAAJ&pg=PA27&lpg=PA27&dq=parsing+lessons+to+homer%27s+Iliad&source=bl&ots=wz3FJ1mRWZ&sig=jkO6v2PxF0jz_d-9r5NMahvUPd0&hl=en&sa=X&ved=0ahUKEwiA-eP5oazSAhVmw1QKHdH5A60Q6AEIJjAD#v=onepage&q=parsing%20lessons%20to%20homer%27s%20Iliad&f=false
https://books.google.com/books?id=wj9WA ... ad&f=false
1 x

Online
User avatar
rdearman
Site Admin
Posts: 2070
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
French (studies), Italian (studies), Mandarin (studies),
Esperanto TAC (Only god knows why), Finnish (only in it for the cookies)
Language Log: viewtopic.php?f=15&t=1836
x 3924
Contact:

Re: Converting pdf to text

Postby rdearman » Sat Feb 25, 2017 11:23 pm

You can do this with calibre to convert pdf to text. http://calibre-ebook.com/

There is an online version here you can try. http://ebook.online-convert.com/
4 x

Ingaræð
Yellow Belt
Posts: 52
Joined: Sat Nov 26, 2016 9:34 pm
Location: United Kingdom
Languages: English (N)
Studying: German (?), French (?), Russian (beg.).
Previously studied (beg.): Italian, Welsh.
Wishlist: Hungarian, most other European languages, Mandarin, Hebrew.
Language Log: https://forum.language-learners.org/vie ... =15&t=4993
x 105

Re: Converting pdf to text

Postby Ingaræð » Sat Feb 25, 2017 11:42 pm

I find that Calibre doesn't always process punctuation accurately, but maybe there are some settings I should be tweaking.

I've used poppler quite a bit for basic conversions with English text. I tried a basic conversion of p. 11 of that pdf, and it processed the Greek as Latin characters, but maybe using some of the options might produce better results..? (Oh, and the version I have installed is pretty old.)

EDIT: oops, wrong link!
1 x
: 56 / 140 Assimil French without toil
: 14 / 30 Pimsleur French I

Adrianslont
Orange Belt
Posts: 210
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Indonesian (lower intermediate?) French (A2?)
x 224

Re: Converting pdf to text

Postby Adrianslont » Sun Feb 26, 2017 1:47 am

A very good question for this forum, I think!

As already mentioned - Calibre. I don't know about a web version but I downloaded the Windows app a while back and have converted just a couple of documents with success.
1 x
: 2779 / 10000 SRS 10k challenge
: 220 / 610 610 days

Cainntear
Blue Belt
Posts: 654
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 1257
Contact:

Re: Converting pdf to text

Postby Cainntear » Sun Feb 26, 2017 8:04 pm

PDF is a nightmare. There really isn't a single tool on the planet that does a good job of converting PDFs to any other format, because that's not what the format was designed for -- it's just about making something printable, not editable.
1 x
A year of Tatoeba recordings: 40 / 365 One donated recording every day in 2017.

User avatar
MorkTheFiddle
Green Belt
Posts: 267
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently), Ancient Greek (abandoned), Latin (abandoned). Once studied Old Norse.
Language Log: http://tinyurl.com/zcx4ogt
x 272

Re: Converting pdf to text

Postby MorkTheFiddle » Mon Feb 27, 2017 12:40 am

Thanks to everyone who replied. I've decided to let this marinate for a while. I can report some findings.

1. Calibre just plain and simply failed. It took a while to work on the pdf, it produced a log showing all pages were converted, but other than the log, its output was nothing.
2. I tried Adobe's export options one more time, but the results were all below par.
3. The linux package is intriguing, but my last linux box perished and I don't trust bash for Windows 10 to do the trick.
4. I could pay someone to keyboard the pdf script to a text file.
5. Or, I could stop using the Iliad for this project and use instead the Latin of Tacitus, whom I prefer to Homer anyway (gasp!!! :twisted:)
0 x

User avatar
tommus
Green Belt
Posts: 279
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2), German (A1), Spanish (A1), Esperanto (A1)
x 446

Re: Converting pdf to text

Postby tommus » Mon Feb 27, 2017 1:44 am

I use Google Drive (GDr) + Google Docs (GDoc), both free.

1. Open GDr in a browser (I use Chrome. I don't know if it works in other browsers.)

2. Drag a pdf file into GDr.

3. Right click on that file in GDr and select "Open in Google Docs".

4. Text is in GDoc.

For good quality PDFs, it is very, very good.
3 x
Dutch
40 Boeken
● 35 Ned. Videos
● 370 Univ-Nederland
: 22 / 40
: 16 / 35
: 140 / 370
● 730 Video Nieuws
● 104 Skype NL Chats
● 730 Tekst Nieuws
: 430 / 730
: 63 / 104
: 430 / 730

Adrianslont
Orange Belt
Posts: 210
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Indonesian (lower intermediate?) French (A2?)
x 224

Re: Converting pdf to text

Postby Adrianslont » Mon Feb 27, 2017 3:30 am

MorkTheFiddle wrote:Perhaps not a proper topic for this forum, but I am having a hard time converting a pdf copy of a work about the Iliad from the 19th century. The work lists in order each word in the first book of the Iliad and gives the definition, part of speech and sometimes other information about the word. Reading Eleanor Dickey's Learning Latin the Ancient Way (first mentioned by Tommus in this forum: see below) gave me the idea that a list of the vocabulary listed in order could be given for other ancient works, especially if all the hard work had been done by a dead white man of a previous century :) . The work in question is Parsing Lessons to Homer's Iliad Book I, 4th edition, whose Google Books url I give below (couldn't find it at archive.org).

Extracting the Greek from the text would be a bonus, but the English text interests me far more. After trying a number of different paths, I finally had to resort to the doyen of pdfs, Adobe. Rather than mortgage my house to buy a copy, I downloaded a trial copy of Adobe and put it to work. The results are not spectacular. Only the html version is of any use. The output is a fair rendition of the English text, but the Greek text turned into nonsense, which makes reading the rest of the text difficult.
I suppose beggars can't be choosers, and what came out is better than nothing (and better than retyping the text by hand (maybe)), but I would like to know of a better method.

BTW, if I finish this project [a big if], it will be free for the asking, not a printed-on-demand thing costing $15 on the Internet.

Finally, I don't see why this method of presenting elementary material can't work for any language.

The first post about Dickey's book was by Tommus http://forum.language-learners.org/viewtopic.php?f=14&t=3369&p=64459&hilit=dickey#p64459.
The pdf I am using can be located here: https://books.google.com/books?id=wj9WAAAAcAAJ&pg=PA27&lpg=PA27&dq=parsing+lessons+to+homer%27s+Iliad&source=bl&ots=wz3FJ1mRWZ&sig=jkO6v2PxF0jz_d-9r5NMahvUPd0&hl=en&sa=X&ved=0ahUKEwiA-eP5oazSAhVmw1QKHdH5A60Q6AEIJjAD#v=onepage&q=parsing%20lessons%20to%20homer%27s%20Iliad&f=false
https://books.google.com/books?id=wj9WA ... ad&f=false

Unless I'm missing something you have only provided a link to google books and not a PDF. If so, can you provide the PDF? I'm feeling like a challenge. No guarantees on results of course!
1 x
: 2779 / 10000 SRS 10k challenge
: 220 / 610 610 days

User avatar
MorkTheFiddle
Green Belt
Posts: 267
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently), Ancient Greek (abandoned), Latin (abandoned). Once studied Old Norse.
Language Log: http://tinyurl.com/zcx4ogt
x 272

Re: Converting pdf to text

Postby MorkTheFiddle » Mon Feb 27, 2017 7:07 pm

Adrianslont wrote:Unless I'm missing something you have only provided a link to google books and not a PDF. If so, can you provide the PDF? I'm feeling like a challenge. No guarantees on results of course!


Thanks for the offer of help. Here is a link to the pdf. https://drive.google.com/open?id=0ByymqjYSyIAJVjk4bk5CZElCelk
0 x

User avatar
MorkTheFiddle
Green Belt
Posts: 267
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently), Ancient Greek (abandoned), Latin (abandoned). Once studied Old Norse.
Language Log: http://tinyurl.com/zcx4ogt
x 272

Re: Converting pdf to text

Postby MorkTheFiddle » Mon Feb 27, 2017 7:11 pm

tommus wrote:I use Google Drive (GDr) + Google Docs (GDoc), both free.

1. Open GDr in a browser (I use Chrome. I don't know if it works in other browsers.)

2. Drag a pdf file into GDr.

3. Right click on that file in GDr and select "Open in Google Docs".

4. Text is in GDoc.

For good quality PDFs, it is very, very good.


I appreciate the suggestion, Tommus. Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.
0 x


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: aokoye, Xenops and 2 guests