Converting pdf to text

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
tommus
Green Belt
Posts: 314
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2), German (A1), Spanish (A1), Esperanto (A1)
x 531

Re: Converting pdf to text

Postby tommus » Thu Mar 02, 2017 8:47 pm

rdearman wrote:You can use imagemagik for the conversion and do them in mass. http://www.imagemagick.org/script/index.php

I have imagemagick but I'm not very "fluent" with it. The GUI version mainly just displays images. The command-line version is much more powerful but has so many commands and options that it has a considerable learning curve. However, it is dead simple once you figure out what you have to do to convert a bunch of TIFs to a single PDF.

magick *.tif a.pdf

Then you can drop a.pdf into Google Drive and open it in Google Docs.

I thought that magick *.tif *.pdf would convert each tif to a single pdf, but that won't run.

And I thought that magick *.tif -append a.pdf would put all the TIFs into a single PDF. It does, but the size of the text/font is extremely tiny (about 1/10th). Strange.

Anyway, magick *.tif a.pdf works just fine. Thanks rdearman. That speeds up the process.
1 x
Dutch
40 Boeken
● 35 Ned. Videos
● 370 Univ-Nederland
: 23 / 40
: 30 / 35
: 145 / 370
● 730 Video Nieuws
● 104 Skype NL Chats
● 730 Tekst Nieuws
: 480 / 730
: 66 / 104
: 480 / 730

User avatar
coldrainwater
Orange Belt
Posts: 127
Joined: Sun Jan 01, 2017 4:53 am
Location: Hugh's Town
Languages: English (N), Spanish (A2)
Language Log: viewtopic.php?p=65330#p65330></
x 113

Re: Converting pdf to text

Postby coldrainwater » Fri Mar 17, 2017 4:09 am

For some time, we used pdf to excel converters until we outgrew them. It may not help you with this project, but now when I need to parse PDF documents, I rely on a .NET C# library called itextsharp. It is free and you can choose which 75 pages of text you wish to try. With higher quality and more regular PDF documents, I have even had success with simple regular expression libraries (after converting via itextsharp) to extract the patterns that I am most interested in.

For the record, you can also count me amongst the group that loves PDFTK. Very powerful tool. Most often I have used it to burst large PDF documents into single pages.

If you can find a pattern, you can typically reduce the amount of manual work you need to do to a very manageable subset of the original task.
2 x

User avatar
MorkTheFiddle
Green Belt
Posts: 346
Joined: Sat Jul 18, 2015 8:59 pm
Location: Texas, USA
Languages: English (N), French (read fluently), Spanish (read fluently). Studying Ancient Greek and Latin. Once studied Old Norse.
Language Log: viewtopic.php?f=15&t=5680&p=70021#p70021
x 408

Re: Converting pdf to text

Postby MorkTheFiddle » Fri Mar 17, 2017 5:35 pm

coldrainwater wrote:For some time, we used pdf to excel converters until we outgrew them. It may not help you with this project, but now when I need to parse PDF documents, I rely on a .NET C# library called itextsharp. It is free and you can choose which 75 pages of text you wish to try. With higher quality and more regular PDF documents, I have even had success with simple regular expression libraries (after converting via itextsharp) to extract the patterns that I am most interested in.

For the record, you can also count me amongst the group that loves PDFTK. Very powerful tool. Most often I have used it to burst large PDF documents into single pages.

If you can find a pattern, you can typically reduce the amount of manual work you need to do to a very manageable subset of the original task.

Interesting suggestion. I'll give it a shot and let you know how it works out. Thanks.
0 x
ἐς Τροίαν πειρώμενοι ἦνθον ᾿Αχαιοί,
καλλίστα παίδων: πείρᾳ θην πάντα τελεῖται.
Theocritus, Idyll 15


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: tastyonions, zenmonkey and 5 guests