Converting pdf to text

tommus · Postby **tommus** » Thu Mar 02, 2017 8:47 pm

rdearman wrote:You can use imagemagik for the conversion and do them in mass. http://www.imagemagick.org/script/index.php

I have imagemagick but I'm not very "fluent" with it. The GUI version mainly just displays images. The command-line version is much more powerful but has so many commands and options that it has a considerable learning curve. However, it is dead simple once you figure out what you have to do to convert a bunch of TIFs to a single PDF.

magick *.tif a.pdf

Then you can drop a.pdf into Google Drive and open it in Google Docs.

I thought that magick *.tif *.pdf would convert each tif to a single pdf, but that won't run.

And I thought that magick *.tif -append a.pdf would put all the TIFs into a single PDF. It does, but the size of the text/font is extremely tiny (about 1/10th). Strange.

Anyway, magick *.tif a.pdf works just fine. Thanks rdearman. That speeds up the process.

coldrainwater · Postby **coldrainwater** » Fri Mar 17, 2017 4:09 am

For some time, we used pdf to excel converters until we outgrew them. It may not help you with this project, but now when I need to parse PDF documents, I rely on a .NET C# library called itextsharp. It is free and you can choose which 75 pages of text you wish to try. With higher quality and more regular PDF documents, I have even had success with simple regular expression libraries (after converting via itextsharp) to extract the patterns that I am most interested in.

For the record, you can also count me amongst the group that loves PDFTK. Very powerful tool. Most often I have used it to burst large PDF documents into single pages.

If you can find a pattern, you can typically reduce the amount of manual work you need to do to a very manageable subset of the original task.

MorkTheFiddle · Postby **MorkTheFiddle** » Fri Mar 17, 2017 5:35 pm

coldrainwater wrote:For some time, we used pdf to excel converters until we outgrew them. It may not help you with this project, but now when I need to parse PDF documents, I rely on a .NET C# library called itextsharp. It is free and you can choose which 75 pages of text you wish to try. With higher quality and more regular PDF documents, I have even had success with simple regular expression libraries (after converting via itextsharp) to extract the patterns that I am most interested in.

For the record, you can also count me amongst the group that loves PDFTK. Very powerful tool. Most often I have used it to burst large PDF documents into single pages.

If you can find a pattern, you can typically reduce the amount of manual work you need to do to a very manageable subset of the original task.

Interesting suggestion. I'll give it a shot and let you know how it works out. Thanks.

zenmonkey · Postby **zenmonkey** » Tue Jul 31, 2018 6:02 pm

Along with magick, ghostscript can be used to convert to images and then ocr those images.

If you want to convert a pdf to text then one possibility is to use tesseract, the open-source ocr program from Google. You need to first convert the pdf file to an image and then pass the image file through tesseract.

Install ghostscript and run (I use 'brew ghostscript install')

Code: Select all

gs -dNOPAUSE -sDEVICE=tif -r300 -dJPEGQ=60 -sOutputFile=output.tif source.pdf -dBATCH

* note gs might be named 'gs64.exe' or something else in your system.

Install tesseract ('brew tesseract install')

Code: Select all

tesseract output.tif outtext -l due+eng

where the -l flag is used to list the language codes contained in the document.

Ghostscript is incredibly fast - a few minutes to convert a 700 page pdf into a single tif file.
tesseract will then process a few pages a minute.

Here is the tesseract manual https://github.com/tesseract-ocr/tesser ... ract.1.asc

Note that currently is does manage a whole bunch of languages:

afr (Afrikaans), amh (Amharic), ara (Arabic), asm (Assamese), aze (Azerbaijani), aze_cyrl (Azerbaijani - Cyrilic), bel (Belarusian), ben (Bengali), bod (Tibetan), bos (Bosnian), bre (Breton), bul (Bulgarian), cat (Catalan; Valencian), ceb (Cebuano), ces (Czech), chi_sim (Chinese - Simplified), chi_tra (Chinese - Traditional), chr (Cherokee), cym (Welsh), dan (Danish), deu (German), dzo (Dzongkha), ell (Greek, Modern (1453-)), eng (English), enm (English, Middle (1100-1500)), epo (Esperanto), equ (Math / equation detection module), est (Estonian), eus (Basque), fas (Persian), fin (Finnish), fra (French), frk (Frankish), frm (French, Middle (ca.1400-1600)), German Log Entry (Irish), glg (Galician), grc (Greek, Ancient (to 1453)), guj (Gujarati), hat (Haitian; Haitian Creole), heb (Hebrew), hin (Hindi), hrv (Croatian), hun (Hungarian), iku (Inuktitut), ind (Indonesian), isl (Icelandic), ita (Italian), ita_old (Italian - Old), jav (Javanese), jpn (Japanese), kan (Kannada), kat (Georgian), kat_old (Georgian - Old), kaz (Kazakh), khm (Central Khmer), kir (Kirghiz; Kyrgyz), kor (Korean), kor_vert (Korean (vertical)), kur (Kurdish), kur_ara (Kurdish (Arabic)), lao (Lao), lat (Latin), lav (Latvian), lit (Lithuanian), ltz (Luxembourgish), mal (Malayalam), mar (Marathi), mkd (Macedonian), mlt (Maltese), mon (Mongolian), mri (Maori), msa (Malay), mya (Burmese), nep (Nepali), nld (Dutch; Flemish), nor (Norwegian), oci (Occitan (post 1500)), ori (Oriya), osd (Orientation and script detection module), pan (Panjabi; Punjabi), pol (Polish), por (Portuguese), pus (Pushto; Pashto), que (Quechua), ron (Romanian; Moldavian; Moldovan), rus (Russian), san (Sanskrit), sin (Sinhala; Sinhalese), slk (Slovak), slv (Slovenian), snd (Sindhi), spa (Spanish; Castilian), spa_old (Spanish; Castilian - Old), sqi (Albanian), srp (Serbian), srp_latn (Serbian - Latin), sun (Sundanese), swa (Swahili), swe (Swedish), syr (Syriac), tam (Tamil), tat (Tatar), tel (Telugu), tgk (Tajik), tgl (Tagalog), tha (Thai), tir (Tigrinya), ton (Tonga), tur (Turkish), uig (Uighur; Uyghur), ukr (Ukrainian), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek - Cyrilic), vie (Vietnamese), yid (Yiddish), yor (Yoruba)

Postby **rdearman** » Tue Jul 31, 2018 6:21 pm

If you load the PDF to Google drive, then try to edit it with google docs Google will do an automatic conversion for you using the state of the art converter they use for books. It will do ocr also. Simple and quick with no need to install any software.

Teango · Postby **Teango** » Tue Jul 31, 2018 7:58 pm

I've had success using ABBYY FineReader Express in the past with a variety of tricky fonts and less common languages (although that was about 7 years ago now!) You could always check to see if the company offers a trial version of their latest OCR conversion software and try it out on a sample of your pdf?

cjareck · Postby **cjareck** » Tue Jul 31, 2018 8:13 pm

I am using tesseract and Google docs with success. If you wish to have a graphical frontend for Tesseract, look for yagf. I have it on Linux, but perhaps there is also a Windows version.

zenmonkey · Postby **zenmonkey** » Wed Aug 01, 2018 7:28 am

rdearman wrote:If you load the PDF to Google drive, then try to edit it with google docs Google will do an automatic conversion for you using the state of the art converter they use for books. It will do ocr also. Simple and quick with no need to install any software.

Very nice. It's unfortunately a mixed bag with setswana.

cjareck wrote:I am using tesseract and Google docs with success. If you wish to have a graphical frontend for Tesseract, look for yagf. I have it on Linux, but perhaps there is also a Windows version.

I've only had terrible output so far and the front end doesn't work with OS so that's out for me.

Teango wrote:I've had success using ABBYY FineReader Express in the past with a variety of tricky fonts and less common languages (although that was about 7 years ago now!) You could always check to see if the company offers a trial version of their latest OCR conversion software and try it out on a sample of your pdf?

Very impressive. They have a 3 page trial exporter - good results (not perfect with old typewriter scans) and they do setswana. But the cost is 99€ and not good enough in my eyes to purchase.

A language learners’ forum

Converting pdf to text

Re: Converting pdf to text

Re: Converting pdf to text

Re: Converting pdf to text

Re: Converting pdf to text

Re: Converting pdf to text

Re: Converting pdf to text

Re: Converting pdf to text

Re: Converting pdf to text

Who is online