Converting pdf to text

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
MorkTheFiddle
Blue Belt
Posts: 506
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently). Studying Ancient Greek. Relearning German.
Language Log: viewtopic.php?f=15&t=5680&p=70021#p70021
x 709

Converting pdf to text

Postby MorkTheFiddle » Sat Feb 25, 2017 10:34 pm

Perhaps not a proper topic for this forum, but I am having a hard time converting a pdf copy of a work about the Iliad from the 19th century. The work lists in order each word in the first book of the Iliad and gives the definition, part of speech and sometimes other information about the word. Reading Eleanor Dickey's Learning Latin the Ancient Way (first mentioned by Tommus in this forum: see below) gave me the idea that a list of the vocabulary listed in order could be given for other ancient works, especially if all the hard work had been done by a dead white man of a previous century :) . The work in question is Parsing Lessons to Homer's Iliad Book I, 4th edition, whose Google Books url I give below (couldn't find it at archive.org).

Extracting the Greek from the text would be a bonus, but the English text interests me far more. After trying a number of different paths, I finally had to resort to the doyen of pdfs, Adobe. Rather than mortgage my house to buy a copy, I downloaded a trial copy of Adobe and put it to work. The results are not spectacular. Only the html version is of any use. The output is a fair rendition of the English text, but the Greek text turned into nonsense, which makes reading the rest of the text difficult.
I suppose beggars can't be choosers, and what came out is better than nothing (and better than retyping the text by hand (maybe)), but I would like to know of a better method.

BTW, if I finish this project [a big if], it will be free for the asking, not a printed-on-demand thing costing $15 on the Internet.

Finally, I don't see why this method of presenting elementary material can't work for any language.

The first post about Dickey's book was by Tommus http://forum.language-learners.org/viewtopic.php?f=14&t=3369&p=64459&hilit=dickey#p64459.
The pdf I am using can be located here: https://books.google.com/books?id=wj9WAAAAcAAJ&pg=PA27&lpg=PA27&dq=parsing+lessons+to+homer%27s+Iliad&source=bl&ots=wz3FJ1mRWZ&sig=jkO6v2PxF0jz_d-9r5NMahvUPd0&hl=en&sa=X&ved=0ahUKEwiA-eP5oazSAhVmw1QKHdH5A60Q6AEIJjAD#v=onepage&q=parsing%20lessons%20to%20homer%27s%20Iliad&f=false
https://books.google.com/books?id=wj9WA ... ad&f=false
1 x
Ah ! Le bon billet qu'a La Châtre !

User avatar
rdearman
Site Admin
Posts: 2641
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
French (studies), Italian (studies), Mandarin (studies),
Esperanto TAC (Only god knows why), Finnish (only in it for the cookies)
Language Log: viewtopic.php?f=15&t=1836
x 5494
Contact:

Re: Converting pdf to text

Postby rdearman » Sat Feb 25, 2017 11:23 pm

You can do this with calibre to convert pdf to text. http://calibre-ebook.com/

There is an online version here you can try. http://ebook.online-convert.com/
4 x
"Never blame on malice that which can be explained by stupidity."

Ingaræð
Orange Belt
Posts: 104
Joined: Sat Nov 26, 2016 9:34 pm
Location: United Kingdom
Languages: English (N)
Studying: German (?), French (?), Russian (beg.).
Previously studied (beg.): Italian, Welsh.
Wishlist: Hungarian, most other European languages, Mandarin, Hebrew.
Language Log: viewtopic.php?f=15&t=4993
x 222

Re: Converting pdf to text

Postby Ingaræð » Sat Feb 25, 2017 11:42 pm

I find that Calibre doesn't always process punctuation accurately, but maybe there are some settings I should be tweaking.

I've used poppler quite a bit for basic conversions with English text. I tried a basic conversion of p. 11 of that pdf, and it processed the Greek as Latin characters, but maybe using some of the options might produce better results..? (Oh, and the version I have installed is pretty old.)

EDIT: oops, wrong link!
1 x
: 43 / 100 Russian without Toil

User avatar
Adrianslont
Green Belt
Posts: 295
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Indonesian (lower intermediate?) French (A2?)
x 403

Re: Converting pdf to text

Postby Adrianslont » Sun Feb 26, 2017 1:47 am

A very good question for this forum, I think!

As already mentioned - Calibre. I don't know about a web version but I downloaded the Windows app a while back and have converted just a couple of documents with success.
1 x
: 2779 / 10000 SRS 10k challenge
: 220 / 610 610 days

Cainntear
Blue Belt
Posts: 849
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 1727
Contact:

Re: Converting pdf to text

Postby Cainntear » Sun Feb 26, 2017 8:04 pm

PDF is a nightmare. There really isn't a single tool on the planet that does a good job of converting PDFs to any other format, because that's not what the format was designed for -- it's just about making something printable, not editable.
1 x
A year of Tatoeba recordings: 40 / 365 One donated recording every day in 2017.

User avatar
MorkTheFiddle
Blue Belt
Posts: 506
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently). Studying Ancient Greek. Relearning German.
Language Log: viewtopic.php?f=15&t=5680&p=70021#p70021
x 709

Re: Converting pdf to text

Postby MorkTheFiddle » Mon Feb 27, 2017 12:40 am

Thanks to everyone who replied. I've decided to let this marinate for a while. I can report some findings.

1. Calibre just plain and simply failed. It took a while to work on the pdf, it produced a log showing all pages were converted, but other than the log, its output was nothing.
2. I tried Adobe's export options one more time, but the results were all below par.
3. The linux package is intriguing, but my last linux box perished and I don't trust bash for Windows 10 to do the trick.
4. I could pay someone to keyboard the pdf script to a text file.
5. Or, I could stop using the Iliad for this project and use instead the Latin of Tacitus, whom I prefer to Homer anyway (gasp!!! :twisted:)
0 x
Ah ! Le bon billet qu'a La Châtre !

User avatar
tommus
Blue Belt
Posts: 511
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2), German (A1), Spanish (A1), Esperanto (A1)
x 785

Re: Converting pdf to text

Postby tommus » Mon Feb 27, 2017 1:44 am

I use Google Drive (GDr) + Google Docs (GDoc), both free.

1. Open GDr in a browser (I use Chrome. I don't know if it works in other browsers.)

2. Drag a pdf file into GDr.

3. Right click on that file in GDr and select "Open in Google Docs".

4. Text is in GDoc.

For good quality PDFs, it is very, very good.
4 x
Dutch
40 Boeken
● 35 Ned. Videos
● 370 Univ-Nederland
: 23 / 40
: 35 / 35
: 155 / 370
● 730 Video Nieuws
● 104 Skype NL Chats
● 730 Tekst Nieuws
: 620 / 730
: 75 / 104
: 620 / 730

User avatar
Adrianslont
Green Belt
Posts: 295
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Indonesian (lower intermediate?) French (A2?)
x 403

Re: Converting pdf to text

Postby Adrianslont » Mon Feb 27, 2017 3:30 am

MorkTheFiddle wrote:Perhaps not a proper topic for this forum, but I am having a hard time converting a pdf copy of a work about the Iliad from the 19th century. The work lists in order each word in the first book of the Iliad and gives the definition, part of speech and sometimes other information about the word. Reading Eleanor Dickey's Learning Latin the Ancient Way (first mentioned by Tommus in this forum: see below) gave me the idea that a list of the vocabulary listed in order could be given for other ancient works, especially if all the hard work had been done by a dead white man of a previous century :) . The work in question is Parsing Lessons to Homer's Iliad Book I, 4th edition, whose Google Books url I give below (couldn't find it at archive.org).

Extracting the Greek from the text would be a bonus, but the English text interests me far more. After trying a number of different paths, I finally had to resort to the doyen of pdfs, Adobe. Rather than mortgage my house to buy a copy, I downloaded a trial copy of Adobe and put it to work. The results are not spectacular. Only the html version is of any use. The output is a fair rendition of the English text, but the Greek text turned into nonsense, which makes reading the rest of the text difficult.
I suppose beggars can't be choosers, and what came out is better than nothing (and better than retyping the text by hand (maybe)), but I would like to know of a better method.

BTW, if I finish this project [a big if], it will be free for the asking, not a printed-on-demand thing costing $15 on the Internet.

Finally, I don't see why this method of presenting elementary material can't work for any language.

The first post about Dickey's book was by Tommus http://forum.language-learners.org/viewtopic.php?f=14&t=3369&p=64459&hilit=dickey#p64459.
The pdf I am using can be located here: https://books.google.com/books?id=wj9WAAAAcAAJ&pg=PA27&lpg=PA27&dq=parsing+lessons+to+homer%27s+Iliad&source=bl&ots=wz3FJ1mRWZ&sig=jkO6v2PxF0jz_d-9r5NMahvUPd0&hl=en&sa=X&ved=0ahUKEwiA-eP5oazSAhVmw1QKHdH5A60Q6AEIJjAD#v=onepage&q=parsing%20lessons%20to%20homer%27s%20Iliad&f=false
https://books.google.com/books?id=wj9WA ... ad&f=false

Unless I'm missing something you have only provided a link to google books and not a PDF. If so, can you provide the PDF? I'm feeling like a challenge. No guarantees on results of course!
1 x
: 2779 / 10000 SRS 10k challenge
: 220 / 610 610 days

User avatar
MorkTheFiddle
Blue Belt
Posts: 506
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently). Studying Ancient Greek. Relearning German.
Language Log: viewtopic.php?f=15&t=5680&p=70021#p70021
x 709

Re: Converting pdf to text

Postby MorkTheFiddle » Mon Feb 27, 2017 7:07 pm

Adrianslont wrote:Unless I'm missing something you have only provided a link to google books and not a PDF. If so, can you provide the PDF? I'm feeling like a challenge. No guarantees on results of course!


Thanks for the offer of help. Here is a link to the pdf. https://drive.google.com/open?id=0ByymqjYSyIAJVjk4bk5CZElCelk
0 x
Ah ! Le bon billet qu'a La Châtre !

User avatar
MorkTheFiddle
Blue Belt
Posts: 506
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently). Studying Ancient Greek. Relearning German.
Language Log: viewtopic.php?f=15&t=5680&p=70021#p70021
x 709

Re: Converting pdf to text

Postby MorkTheFiddle » Mon Feb 27, 2017 7:11 pm

tommus wrote:I use Google Drive (GDr) + Google Docs (GDoc), both free.

1. Open GDr in a browser (I use Chrome. I don't know if it works in other browsers.)

2. Drag a pdf file into GDr.

3. Right click on that file in GDr and select "Open in Google Docs".

4. Text is in GDoc.

For good quality PDFs, it is very, very good.


I appreciate the suggestion, Tommus. Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.
0 x
Ah ! Le bon billet qu'a La Châtre !


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: No registered users and 1 guest