Converting pdf to text

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2114
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4824

Converting pdf to text

Postby MorkTheFiddle » Sat Feb 25, 2017 10:34 pm

Perhaps not a proper topic for this forum, but I am having a hard time converting a pdf copy of a work about the Iliad from the 19th century. The work lists in order each word in the first book of the Iliad and gives the definition, part of speech and sometimes other information about the word. Reading Eleanor Dickey's Learning Latin the Ancient Way (first mentioned by Tommus in this forum: see below) gave me the idea that a list of the vocabulary listed in order could be given for other ancient works, especially if all the hard work had been done by a dead white man of a previous century :) . The work in question is Parsing Lessons to Homer's Iliad Book I, 4th edition, whose Google Books url I give below (couldn't find it at archive.org).

Extracting the Greek from the text would be a bonus, but the English text interests me far more. After trying a number of different paths, I finally had to resort to the doyen of pdfs, Adobe. Rather than mortgage my house to buy a copy, I downloaded a trial copy of Adobe and put it to work. The results are not spectacular. Only the html version is of any use. The output is a fair rendition of the English text, but the Greek text turned into nonsense, which makes reading the rest of the text difficult.
I suppose beggars can't be choosers, and what came out is better than nothing (and better than retyping the text by hand (maybe)), but I would like to know of a better method.

BTW, if I finish this project [a big if], it will be free for the asking, not a printed-on-demand thing costing $15 on the Internet.

Finally, I don't see why this method of presenting elementary material can't work for any language.

The first post about Dickey's book was by Tommus http://forum.language-learners.org/viewtopic.php?f=14&t=3369&p=64459&hilit=dickey#p64459.
The pdf I am using can be located here: https://books.google.com/books?id=wj9WAAAAcAAJ&pg=PA27&lpg=PA27&dq=parsing+lessons+to+homer%27s+Iliad&source=bl&ots=wz3FJ1mRWZ&sig=jkO6v2PxF0jz_d-9r5NMahvUPd0&hl=en&sa=X&ved=0ahUKEwiA-eP5oazSAhVmw1QKHdH5A60Q6AEIJjAD#v=onepage&q=parsing%20lessons%20to%20homer%27s%20Iliad&f=false
https://books.google.com/books?id=wj9WA ... ad&f=false
0 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23125
Contact:

Re: Converting pdf to text

Postby rdearman » Sat Feb 25, 2017 11:23 pm

You can do this with calibre to convert pdf to text. http://calibre-ebook.com/

There is an online version here you can try. http://ebook.online-convert.com/
4 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

Ingaræð
Orange Belt
Posts: 170
Joined: Sat Nov 26, 2016 9:34 pm
Languages: English (N), German (heritage)
Learning: Russian, French, German, Mandarin, Arabic, Spanish.
Mostly forgotten: Italian, Welsh.
x 377

Re: Converting pdf to text

Postby Ingaræð » Sat Feb 25, 2017 11:42 pm

I find that Calibre doesn't always process punctuation accurately, but maybe there are some settings I should be tweaking.

I've used poppler quite a bit for basic conversions with English text. I tried a basic conversion of p. 11 of that pdf, and it processed the Greek as Latin characters, but maybe using some of the options might produce better results..? (Oh, and the version I have installed is pretty old.)

EDIT: oops, wrong link!
1 x

User avatar
Adrianslont
Blue Belt
Posts: 827
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Learning Indonesian and French
x 1936

Re: Converting pdf to text

Postby Adrianslont » Sun Feb 26, 2017 1:47 am

A very good question for this forum, I think!

As already mentioned - Calibre. I don't know about a web version but I downloaded the Windows app a while back and have converted just a couple of documents with success.
1 x

Cainntear
Black Belt - 3rd Dan
Posts: 3469
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8663
Contact:

Re: Converting pdf to text

Postby Cainntear » Sun Feb 26, 2017 8:04 pm

PDF is a nightmare. There really isn't a single tool on the planet that does a good job of converting PDFs to any other format, because that's not what the format was designed for -- it's just about making something printable, not editable.
1 x

User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2114
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4824

Re: Converting pdf to text

Postby MorkTheFiddle » Mon Feb 27, 2017 12:40 am

Thanks to everyone who replied. I've decided to let this marinate for a while. I can report some findings.

1. Calibre just plain and simply failed. It took a while to work on the pdf, it produced a log showing all pages were converted, but other than the log, its output was nothing.
2. I tried Adobe's export options one more time, but the results were all below par.
3. The linux package is intriguing, but my last linux box perished and I don't trust bash for Windows 10 to do the trick.
4. I could pay someone to keyboard the pdf script to a text file.
5. Or, I could stop using the Iliad for this project and use instead the Latin of Tacitus, whom I prefer to Homer anyway (gasp!!! :twisted:)
0 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson

User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Converting pdf to text

Postby tommus » Mon Feb 27, 2017 1:44 am

I use Google Drive (GDr) + Google Docs (GDoc), both free.

1. Open GDr in a browser (I use Chrome. I don't know if it works in other browsers.)

2. Drag a pdf file into GDr.

3. Right click on that file in GDr and select "Open in Google Docs".

4. Text is in GDoc.

For good quality PDFs, it is very, very good.
4 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000

User avatar
Adrianslont
Blue Belt
Posts: 827
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Learning Indonesian and French
x 1936

Re: Converting pdf to text

Postby Adrianslont » Mon Feb 27, 2017 3:30 am

MorkTheFiddle wrote:Perhaps not a proper topic for this forum, but I am having a hard time converting a pdf copy of a work about the Iliad from the 19th century. The work lists in order each word in the first book of the Iliad and gives the definition, part of speech and sometimes other information about the word. Reading Eleanor Dickey's Learning Latin the Ancient Way (first mentioned by Tommus in this forum: see below) gave me the idea that a list of the vocabulary listed in order could be given for other ancient works, especially if all the hard work had been done by a dead white man of a previous century :) . The work in question is Parsing Lessons to Homer's Iliad Book I, 4th edition, whose Google Books url I give below (couldn't find it at archive.org).

Extracting the Greek from the text would be a bonus, but the English text interests me far more. After trying a number of different paths, I finally had to resort to the doyen of pdfs, Adobe. Rather than mortgage my house to buy a copy, I downloaded a trial copy of Adobe and put it to work. The results are not spectacular. Only the html version is of any use. The output is a fair rendition of the English text, but the Greek text turned into nonsense, which makes reading the rest of the text difficult.
I suppose beggars can't be choosers, and what came out is better than nothing (and better than retyping the text by hand (maybe)), but I would like to know of a better method.

BTW, if I finish this project [a big if], it will be free for the asking, not a printed-on-demand thing costing $15 on the Internet.

Finally, I don't see why this method of presenting elementary material can't work for any language.

The first post about Dickey's book was by Tommus http://forum.language-learners.org/viewtopic.php?f=14&t=3369&p=64459&hilit=dickey#p64459.
The pdf I am using can be located here: https://books.google.com/books?id=wj9WAAAAcAAJ&pg=PA27&lpg=PA27&dq=parsing+lessons+to+homer%27s+Iliad&source=bl&ots=wz3FJ1mRWZ&sig=jkO6v2PxF0jz_d-9r5NMahvUPd0&hl=en&sa=X&ved=0ahUKEwiA-eP5oazSAhVmw1QKHdH5A60Q6AEIJjAD#v=onepage&q=parsing%20lessons%20to%20homer%27s%20Iliad&f=false
https://books.google.com/books?id=wj9WA ... ad&f=false

Unless I'm missing something you have only provided a link to google books and not a PDF. If so, can you provide the PDF? I'm feeling like a challenge. No guarantees on results of course!
1 x

User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2114
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4824

Re: Converting pdf to text

Postby MorkTheFiddle » Mon Feb 27, 2017 7:07 pm

Adrianslont wrote:Unless I'm missing something you have only provided a link to google books and not a PDF. If so, can you provide the PDF? I'm feeling like a challenge. No guarantees on results of course!


Thanks for the offer of help. Here is a link to the pdf. https://drive.google.com/open?id=0ByymqjYSyIAJVjk4bk5CZElCelk
0 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson

User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2114
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4824

Re: Converting pdf to text

Postby MorkTheFiddle » Mon Feb 27, 2017 7:11 pm

tommus wrote:I use Google Drive (GDr) + Google Docs (GDoc), both free.

1. Open GDr in a browser (I use Chrome. I don't know if it works in other browsers.)

2. Drag a pdf file into GDr.

3. Right click on that file in GDr and select "Open in Google Docs".

4. Text is in GDoc.

For good quality PDFs, it is very, very good.


I appreciate the suggestion, Tommus. Unfortunately, when I try this, Google tells me, "Unable to Convert Document", without further explanation.
0 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: No registered users and 2 guests