ScanSnap, OCR and non-latin scripts

Ask specific questions about your target languages. Beginner questions welcome!
Notorious MIBG
White Belt
Posts: 21
Joined: Tue Oct 10, 2017 1:50 am
Languages: Native: English
Beginner: Korean
Language Log: https://forum.language-learners.org/vie ... =15&t=7752
x 28

ScanSnap, OCR and non-latin scripts

Postby Notorious MIBG » Tue Feb 13, 2018 1:06 am

I'm using the ScanSnap manager with my Fujitsu ScanSnap to OCR (Edit: It uses the ABBYY FineReader engine) a Korean language book I've unbound in order to be able to copy and paste Korean sentences into Anki or other electronic documents, which would otherwise take me an enormous amount of time to type out.

I did the scan today and set the language under the "File Option" tab to Korean and target pages to "All Pages", and of course checked "Convert to Searchable PDF". I turned up the resolution and turned down the compression. In the end I got a beautiful PDF that, when I copied Korean text that was easily readable and on a single line, came out as complete gibberish when pasted into a text file.

Has anybody had success with non-latin foreign language scripts and OCR?

I'm at a loss.

Best,
MIBG
Last edited by Notorious MIBG on Tue Feb 13, 2018 10:51 am, edited 1 time in total.
0 x
Pimsleur Korean 1 + 2: 47 / 60
Building KOLUP 1000 Sentence Anki Deck: 821 / 1000

mcthulhu
Orange Belt
Posts: 176
Joined: Sun Feb 26, 2017 4:01 pm
Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
x 467

Re: ScanSnap, OCR and non-latin scripts

Postby mcthulhu » Tue Feb 13, 2018 5:25 am

PDF is almost certainly the problem. This is a very common issue with PDF files, and I've run into it myself before. I think PDF is supposed to be a final product for viewing, and PDF is usually not a very good input format for follow-on processing. For some background, see https://stackoverflow.com/questions/121 ... arbled-pdf, specifically:

"Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

"Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes)."

That sounds like what you've got going on. You might try some utility designed to extract text from PDFs. There are many options out there. One is the pdftotext utility - see https://en.wikipedia.org/wiki/Pdftotext.

Otherwise - Does ScanSnap offer any better file save options other than PDF? Ideally you'd want plain text, or Word. If you can at least save page images, you could run those through another OCR program, though that might be more work. Depending on the program, you might be able to OCR your PDF as a single file and save some time. ABBYY FineReader is good. There are lots of options for Korean OCR, if you Google for it, including some smartphone apps. http://www.i2ocr.com/free-online-korean-ocr could help; I haven't tried it, but it's free.

I've used the open source Capture2Text before, and it apparently supports Korean too: http://capture2text.sourceforge.net/. It runs under Windows, and I think it's a wrapper around Tesseract. If you're on Linux, OCRFeeder might work (another front end for Tesseract).

BTW, typing out a book would do wonders for your typing speed...
3 x

User avatar
rdearman
Site Admin
Posts: 3887
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 8555
Contact:

Re: ScanSnap, OCR and non-latin scripts

Postby rdearman » Tue Feb 13, 2018 9:03 am

I remember another member EMK used anki with ancient Egyptian hieroglyphics and he simply took a photo of the relevant sections and used the image as the prompt, since anki supports photos as well as other media. It would be quicker I think.
1 x
: 4 / 100 100 Italian paperbacks:

Notorious MIBG
White Belt
Posts: 21
Joined: Tue Oct 10, 2017 1:50 am
Languages: Native: English
Beginner: Korean
Language Log: https://forum.language-learners.org/vie ... =15&t=7752
x 28

Re: ScanSnap, OCR and non-latin scripts

Postby Notorious MIBG » Tue Feb 13, 2018 10:51 am

mcthulhu wrote:PDF is almost certainly the problem. This is a very common issue with PDF files, and I've run into it myself before. I think PDF is supposed to be a final product for viewing, and PDF is usually not a very good input format for follow-on processing. For some background, see https://stackoverflow.com/questions/121 ... arbled-pdf, specifically:

"Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

"Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes)."

That sounds like what you've got going on. You might try some utility designed to extract text from PDFs. There are many options out there. One is the pdftotext utility - see https://en.wikipedia.org/wiki/Pdftotext.

Otherwise - Does ScanSnap offer any better file save options other than PDF? Ideally you'd want plain text, or Word. If you can at least save page images, you could run those through another OCR program, though that might be more work. Depending on the program, you might be able to OCR your PDF as a single file and save some time. ABBYY FineReader is good. There are lots of options for Korean OCR, if you Google for it, including some smartphone apps. http://www.i2ocr.com/free-online-korean-ocr could help; I haven't tried it, but it's free.

I've used the open source Capture2Text before, and it apparently supports Korean too: http://capture2text.sourceforge.net/. It runs under Windows, and I think it's a wrapper around Tesseract. If you're on Linux, OCRFeeder might work (another front end for Tesseract).

BTW, typing out a book would do wonders for your typing speed...


Thanks for the information. Unfortunately, ScanSnap uses the ABBYY FineReader engine. Normally that would be a fortunate thing, but since it's considered "the best" and isn't working, it seems particularly unfortunate...

I'm going to try to tweak some more setting to see if I can get any better outcome. Otherwise, I might do what rdearman suggested and just use images.
0 x
Pimsleur Korean 1 + 2: 47 / 60
Building KOLUP 1000 Sentence Anki Deck: 821 / 1000

dampingwire
Green Belt
Posts: 385
Joined: Tue Aug 04, 2015 8:11 pm
Location: Abingdon, UK
Languages: Italian (N), English (N), French (poor, not studying), Japanese (studying, JLPT N3)
x 335

Re: ScanSnap, OCR and non-latin scripts

Postby dampingwire » Tue Feb 13, 2018 5:56 pm

I don't know why ABBY isn't working for you but the stackoverflow answer is referring to PDF files that produced from other document formats, not something scanned. So for those documents the article is saying that you can have standard fonts referred to in the document, in which case copy-and-paste would work perfectly well or embedded fonts might be used in which case if no translation table is provided then copy-and-paste won't work (since all that's known is that this character is #47 in embedded font #2).

For a scanned document you have dots and their position. The OCR software then tries to make sense of the dots. For that to work it has to have some knowledge of the characters it should be looking for. So for Korean fonts it would need to know that it should be looking for something with some range of Korean fonts. I don't know if a range of Korean fonts is possible, but if it is then you may need to train it a little better.

I'd suggest starting with a single page of text and working with that. You should probably scan in 8-bit gray scale rather than the default (which is likely to be JPEG, which is probably the worst possible choice for text).

FWIW I tried ABBY about 5 years ago on some scans of (English) computer manuals and, while it was 80% correct, that was basically unusable (at 500 letters per page that would mean 100 corrections per page ... not practical for a 50 page manual). I've noticed that google does index publicly available scans and seems to do a good job, so perhaps serve up a page on the web, persuade google to index it and see how good that is?
1 x
新完全マスター N2聴解 : 94 / 103新完全マスター N2読解 : 99 / 177
新完全マスター N2文法 : 197 / 197TY Comp. German : 0 / 389

mcthulhu
Orange Belt
Posts: 176
Joined: Sun Feb 26, 2017 4:01 pm
Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
x 467

Re: ScanSnap, OCR and non-latin scripts

Postby mcthulhu » Tue Feb 13, 2018 11:30 pm

MIBG, the question is not OCR accuracy, but rather getting to an output format more usable than PDF; accuracy is not relevant to the file types used. FineReader has 15 possible output formats, only two of which are PDF. ScanSnap, as far as I can tell (I don't have it), basically gives you a choice of PDF or JPEG.

If you can legally post your scanned book somewhere, I could try to take a look at it.

But if just leaving the Korean as image files and not bothering with OCR works for you, go for it. File sizes will be much larger, but that may not matter to you.
0 x

Cainntear
Brown Belt
Posts: 1164
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 2711
Contact:

Re: ScanSnap, OCR and non-latin scripts

Postby Cainntear » Wed Feb 14, 2018 8:41 pm

My approach to scanning odd scripts/languages would be to just use images initially, and then plan to (but probably never get round to) manually transcribing them later. So if I was making Anki cards, maybe aiming to switch 5-10% of the days cards from jpg to text every day.
1 x

Notorious MIBG
White Belt
Posts: 21
Joined: Tue Oct 10, 2017 1:50 am
Languages: Native: English
Beginner: Korean
Language Log: https://forum.language-learners.org/vie ... =15&t=7752
x 28

Re: ScanSnap, OCR and non-latin scripts

Postby Notorious MIBG » Tue Feb 20, 2018 2:17 am

Just an update since all of you were so kind to help in the first place: after fiddling with the settings and scanning this textbook 10 or so times I think I came up with a workflow that will work for me. It's not perfect, but I figure it will help me make flashcards significantly faster than I can now, as the bottle neck has always been and will likely be long into the future, my typing speed in Korean.

For the person who asked, here's an example page of the textbook that I'm OCR'ing, The Sounds of Korean:

pl9OgRo.png
pl9OgRo.png (107.84 KiB) Viewed 148 times


With the job I've done now the sentences on the left-hand-side in Korean are able to be copied and pasted almost without flaw.
0 x
Pimsleur Korean 1 + 2: 47 / 60
Building KOLUP 1000 Sentence Anki Deck: 821 / 1000

User avatar
zenmonkey
Brown Belt
Posts: 1481
Joined: Sun Jul 26, 2015 7:21 pm
Location: Germany and France
Languages: Spanish, English, French trilingual - actively studying German (B2/C1), Hebrew, Tibetan, Setswana.
Some knowledge of Italian, Portuguese, Ladino, Yiddish ...
Want to tackle Tzotzil, Nahuatl
Language Log: viewtopic.php?f=15&t=859
x 3393
Contact:

Re: ScanSnap, OCR and non-latin scripts

Postby zenmonkey » Tue Jul 31, 2018 5:57 pm

This post is a few months old but I thought I'd throw in my two cents to help out on the pdf to ocr front.

If you want to convert a pdf to text then one possibility is to use tesseract, the open-source ocr program from Google. You need to first convert the pdf file to an image and then pass the image file through tesseract.

Install ghostscript and run (I use 'brew ghostscript install')

Code: Select all

gs -dNOPAUSE -sDEVICE=tif -r300 -dJPEGQ=60 -sOutputFile=output.tif source.pdf -dBATCH

* note gs might be named 'gs64.exe' or something else in your system.

Install tesseract ('brew tesseract install')

Code: Select all

tesseract output.tif outtext -l due+eng


where the -l flag is used to list the language codes contained in the document
0 x
Tagged posts: Language Method Resource
Please feel free to correct me in any language, critique my posts, challenge my thoughts.
I am inconsistency incarnate.
Go study! Publisher of Syriac, Aramaic, Hebrew alphabet apps at http://alphabetsnow.zyntx.com


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: eido and 2 guests