Converting pdf to text

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Converting pdf to text

Postby tommus » Thu Mar 02, 2017 8:47 pm

rdearman wrote:You can use imagemagik for the conversion and do them in mass. http://www.imagemagick.org/script/index.php

I have imagemagick but I'm not very "fluent" with it. The GUI version mainly just displays images. The command-line version is much more powerful but has so many commands and options that it has a considerable learning curve. However, it is dead simple once you figure out what you have to do to convert a bunch of TIFs to a single PDF.

magick *.tif a.pdf

Then you can drop a.pdf into Google Drive and open it in Google Docs.

I thought that magick *.tif *.pdf would convert each tif to a single pdf, but that won't run.

And I thought that magick *.tif -append a.pdf would put all the TIFs into a single PDF. It does, but the size of the text/font is extremely tiny (about 1/10th). Strange.

Anyway, magick *.tif a.pdf works just fine. Thanks rdearman. That speeds up the process.
1 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000

User avatar
coldrainwater
Blue Belt
Posts: 686
Joined: Sun Jan 01, 2017 4:53 am
Location: Magnolia, TX
Languages: EN(N), ES(rusty), DE(), FR(studies)
Language Log: https://forum.language-learners.org/vie ... =15&t=7636
x 2381

Re: Converting pdf to text

Postby coldrainwater » Fri Mar 17, 2017 4:09 am

For some time, we used pdf to excel converters until we outgrew them. It may not help you with this project, but now when I need to parse PDF documents, I rely on a .NET C# library called itextsharp. It is free and you can choose which 75 pages of text you wish to try. With higher quality and more regular PDF documents, I have even had success with simple regular expression libraries (after converting via itextsharp) to extract the patterns that I am most interested in.

For the record, you can also count me amongst the group that loves PDFTK. Very powerful tool. Most often I have used it to burst large PDF documents into single pages.

If you can find a pattern, you can typically reduce the amount of manual work you need to do to a very manageable subset of the original task.
2 x

User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2113
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4822

Re: Converting pdf to text

Postby MorkTheFiddle » Fri Mar 17, 2017 5:35 pm

coldrainwater wrote:For some time, we used pdf to excel converters until we outgrew them. It may not help you with this project, but now when I need to parse PDF documents, I rely on a .NET C# library called itextsharp. It is free and you can choose which 75 pages of text you wish to try. With higher quality and more regular PDF documents, I have even had success with simple regular expression libraries (after converting via itextsharp) to extract the patterns that I am most interested in.

For the record, you can also count me amongst the group that loves PDFTK. Very powerful tool. Most often I have used it to burst large PDF documents into single pages.

If you can find a pattern, you can typically reduce the amount of manual work you need to do to a very manageable subset of the original task.

Interesting suggestion. I'll give it a shot and let you know how it works out. Thanks.
0 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson

User avatar
zenmonkey
Black Belt - 2nd Dan
Posts: 2528
Joined: Sun Jul 26, 2015 7:21 pm
Location: California, Germany and France
Languages: Spanish, English, French trilingual - German (B2/C1) on/off study: Persian, Hebrew, Tibetan, Setswana.
Some knowledge of Italian, Portuguese, Ladino, Yiddish ...
Want to tackle Tzotzil, Nahuatl
Language Log: viewtopic.php?f=15&t=859
x 7030
Contact:

Re: Converting pdf to text

Postby zenmonkey » Tue Jul 31, 2018 6:02 pm

Along with magick, ghostscript can be used to convert to images and then ocr those images.

If you want to convert a pdf to text then one possibility is to use tesseract, the open-source ocr program from Google. You need to first convert the pdf file to an image and then pass the image file through tesseract.

Install ghostscript and run (I use 'brew ghostscript install')

Code: Select all

gs -dNOPAUSE -sDEVICE=tif -r300 -dJPEGQ=60 -sOutputFile=output.tif source.pdf -dBATCH

* note gs might be named 'gs64.exe' or something else in your system.

Install tesseract ('brew tesseract install')

Code: Select all

tesseract output.tif outtext -l due+eng


where the -l flag is used to list the language codes contained in the document.

Ghostscript is incredibly fast - a few minutes to convert a 700 page pdf into a single tif file.
tesseract will then process a few pages a minute.

Here is the tesseract manual https://github.com/tesseract-ocr/tesser ... ract.1.asc

Note that currently is does manage a whole bunch of languages:
afr (Afrikaans), amh (Amharic), ara (Arabic), asm (Assamese), aze (Azerbaijani), aze_cyrl (Azerbaijani - Cyrilic), bel (Belarusian), ben (Bengali), bod (Tibetan), bos (Bosnian), bre (Breton), bul (Bulgarian), cat (Catalan; Valencian), ceb (Cebuano), ces (Czech), chi_sim (Chinese - Simplified), chi_tra (Chinese - Traditional), chr (Cherokee), cym (Welsh), dan (Danish), deu (German), dzo (Dzongkha), ell (Greek, Modern (1453-)), eng (English), enm (English, Middle (1100-1500)), epo (Esperanto), equ (Math / equation detection module), est (Estonian), eus (Basque), fas (Persian), fin (Finnish), fra (French), frk (Frankish), frm (French, Middle (ca.1400-1600)), German Log Entry (Irish), glg (Galician), grc (Greek, Ancient (to 1453)), guj (Gujarati), hat (Haitian; Haitian Creole), heb (Hebrew), hin (Hindi), hrv (Croatian), hun (Hungarian), iku (Inuktitut), ind (Indonesian), isl (Icelandic), ita (Italian), ita_old (Italian - Old), jav (Javanese), jpn (Japanese), kan (Kannada), kat (Georgian), kat_old (Georgian - Old), kaz (Kazakh), khm (Central Khmer), kir (Kirghiz; Kyrgyz), kor (Korean), kor_vert (Korean (vertical)), kur (Kurdish), kur_ara (Kurdish (Arabic)), lao (Lao), lat (Latin), lav (Latvian), lit (Lithuanian), ltz (Luxembourgish), mal (Malayalam), mar (Marathi), mkd (Macedonian), mlt (Maltese), mon (Mongolian), mri (Maori), msa (Malay), mya (Burmese), nep (Nepali), nld (Dutch; Flemish), nor (Norwegian), oci (Occitan (post 1500)), ori (Oriya), osd (Orientation and script detection module), pan (Panjabi; Punjabi), pol (Polish), por (Portuguese), pus (Pushto; Pashto), que (Quechua), ron (Romanian; Moldavian; Moldovan), rus (Russian), san (Sanskrit), sin (Sinhala; Sinhalese), slk (Slovak), slv (Slovenian), snd (Sindhi), spa (Spanish; Castilian), spa_old (Spanish; Castilian - Old), sqi (Albanian), srp (Serbian), srp_latn (Serbian - Latin), sun (Sundanese), swa (Swahili), swe (Swedish), syr (Syriac), tam (Tamil), tat (Tatar), tel (Telugu), tgk (Tajik), tgl (Tagalog), tha (Thai), tir (Tigrinya), ton (Tonga), tur (Turkish), uig (Uighur; Uyghur), ukr (Ukrainian), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek - Cyrilic), vie (Vietnamese), yid (Yiddish), yor (Yoruba)
0 x
I am a leaf on the wind, watch how I soar

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23120
Contact:

Re: Converting pdf to text

Postby rdearman » Tue Jul 31, 2018 6:21 pm

If you load the PDF to Google drive, then try to edit it with google docs Google will do an automatic conversion for you using the state of the art converter they use for books. It will do ocr also. Simple and quick with no need to install any software.
6 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
Teango
Blue Belt
Posts: 766
Joined: Mon Jul 06, 2015 4:55 am
Location: Honolulu, Hawaiʻi
Languages: en (n)
Language Log: https://forum.language-learners.org/vie ... 9&p=235545
x 2943
Contact:

Re: Converting pdf to text

Postby Teango » Tue Jul 31, 2018 7:58 pm

I've had success using ABBYY FineReader Express in the past with a variety of tricky fonts and less common languages (although that was about 7 years ago now!) You could always check to see if the company offers a trial version of their latest OCR conversion software and try it out on a sample of your pdf?
0 x

User avatar
cjareck
Brown Belt
Posts: 1047
Joined: Tue Apr 25, 2017 6:11 pm
Location: Poland
Languages: Polish (N) English, German, Russian(B1?) French (B1?), Hebrew(B1?), Arabic(A2?), Mandarin (HSK 2)
Language Log: https://forum.language-learners.org/vie ... =15&t=8589
x 2979
Contact:

Re: Converting pdf to text

Postby cjareck » Tue Jul 31, 2018 8:13 pm

I am using tesseract and Google docs with success. If you wish to have a graphical frontend for Tesseract, look for yagf. I have it on Linux, but perhaps there is also a Windows version.
0 x
Please feel free to correct me in any language


Listening: 1+ (83% content, 90% linguistic)
Reading: 1 (83% content, 90% linguistic)


MSA DLI : 30 / 141ESKK : 18 / 40


Mandarin Assimil : 62 / 105

User avatar
zenmonkey
Black Belt - 2nd Dan
Posts: 2528
Joined: Sun Jul 26, 2015 7:21 pm
Location: California, Germany and France
Languages: Spanish, English, French trilingual - German (B2/C1) on/off study: Persian, Hebrew, Tibetan, Setswana.
Some knowledge of Italian, Portuguese, Ladino, Yiddish ...
Want to tackle Tzotzil, Nahuatl
Language Log: viewtopic.php?f=15&t=859
x 7030
Contact:

Re: Converting pdf to text

Postby zenmonkey » Wed Aug 01, 2018 7:28 am

rdearman wrote:If you load the PDF to Google drive, then try to edit it with google docs Google will do an automatic conversion for you using the state of the art converter they use for books. It will do ocr also. Simple and quick with no need to install any software.


Very nice. It's unfortunately a mixed bag with setswana.

cjareck wrote:I am using tesseract and Google docs with success. If you wish to have a graphical frontend for Tesseract, look for yagf. I have it on Linux, but perhaps there is also a Windows version.


I've only had terrible output so far and the front end doesn't work with OS so that's out for me.

Teango wrote:I've had success using ABBYY FineReader Express in the past with a variety of tricky fonts and less common languages (although that was about 7 years ago now!) You could always check to see if the company offers a trial version of their latest OCR conversion software and try it out on a sample of your pdf?


Very impressive. They have a 3 page trial exporter - good results (not perfect with old typewriter scans) and they do setswana. But the cost is 99€ and not good enough in my eyes to purchase.
0 x
I am a leaf on the wind, watch how I soar


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: No registered users and 2 guests