Page 1 of 1

ScanSnap, OCR and non-latin scripts

Posted: Tue Feb 13, 2018 1:06 am
by Notorious MIBG
I'm using the ScanSnap manager with my Fujitsu ScanSnap to OCR (Edit: It uses the ABBYY FineReader engine) a Korean language book I've unbound in order to be able to copy and paste Korean sentences into Anki or other electronic documents, which would otherwise take me an enormous amount of time to type out.

I did the scan today and set the language under the "File Option" tab to Korean and target pages to "All Pages", and of course checked "Convert to Searchable PDF". I turned up the resolution and turned down the compression. In the end I got a beautiful PDF that, when I copied Korean text that was easily readable and on a single line, came out as complete gibberish when pasted into a text file.

Has anybody had success with non-latin foreign language scripts and OCR?

I'm at a loss.

Best,
MIBG

Re: ScanSnap, OCR and non-latin scripts

Posted: Tue Feb 13, 2018 5:25 am
by mcthulhu
PDF is almost certainly the problem. This is a very common issue with PDF files, and I've run into it myself before. I think PDF is supposed to be a final product for viewing, and PDF is usually not a very good input format for follow-on processing. For some background, see https://stackoverflow.com/questions/121 ... arbled-pdf, specifically:

"Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

"Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes)."

That sounds like what you've got going on. You might try some utility designed to extract text from PDFs. There are many options out there. One is the pdftotext utility - see https://en.wikipedia.org/wiki/Pdftotext.

Otherwise - Does ScanSnap offer any better file save options other than PDF? Ideally you'd want plain text, or Word. If you can at least save page images, you could run those through another OCR program, though that might be more work. Depending on the program, you might be able to OCR your PDF as a single file and save some time. ABBYY FineReader is good. There are lots of options for Korean OCR, if you Google for it, including some smartphone apps. http://www.i2ocr.com/free-online-korean-ocr could help; I haven't tried it, but it's free.

I've used the open source Capture2Text before, and it apparently supports Korean too: http://capture2text.sourceforge.net/. It runs under Windows, and I think it's a wrapper around Tesseract. If you're on Linux, OCRFeeder might work (another front end for Tesseract).

BTW, typing out a book would do wonders for your typing speed...

Re: ScanSnap, OCR and non-latin scripts

Posted: Tue Feb 13, 2018 9:03 am
by rdearman
I remember another member EMK used anki with ancient Egyptian hieroglyphics and he simply took a photo of the relevant sections and used the image as the prompt, since anki supports photos as well as other media. It would be quicker I think.

Re: ScanSnap, OCR and non-latin scripts

Posted: Tue Feb 13, 2018 10:51 am
by Notorious MIBG
mcthulhu wrote:PDF is almost certainly the problem. This is a very common issue with PDF files, and I've run into it myself before. I think PDF is supposed to be a final product for viewing, and PDF is usually not a very good input format for follow-on processing. For some background, see https://stackoverflow.com/questions/121 ... arbled-pdf, specifically:

"Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

"Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes)."

That sounds like what you've got going on. You might try some utility designed to extract text from PDFs. There are many options out there. One is the pdftotext utility - see https://en.wikipedia.org/wiki/Pdftotext.

Otherwise - Does ScanSnap offer any better file save options other than PDF? Ideally you'd want plain text, or Word. If you can at least save page images, you could run those through another OCR program, though that might be more work. Depending on the program, you might be able to OCR your PDF as a single file and save some time. ABBYY FineReader is good. There are lots of options for Korean OCR, if you Google for it, including some smartphone apps. http://www.i2ocr.com/free-online-korean-ocr could help; I haven't tried it, but it's free.

I've used the open source Capture2Text before, and it apparently supports Korean too: http://capture2text.sourceforge.net/. It runs under Windows, and I think it's a wrapper around Tesseract. If you're on Linux, OCRFeeder might work (another front end for Tesseract).

BTW, typing out a book would do wonders for your typing speed...


Thanks for the information. Unfortunately, ScanSnap uses the ABBYY FineReader engine. Normally that would be a fortunate thing, but since it's considered "the best" and isn't working, it seems particularly unfortunate...

I'm going to try to tweak some more setting to see if I can get any better outcome. Otherwise, I might do what rdearman suggested and just use images.

Re: ScanSnap, OCR and non-latin scripts

Posted: Tue Feb 13, 2018 5:56 pm
by dampingwire
I don't know why ABBY isn't working for you but the stackoverflow answer is referring to PDF files that produced from other document formats, not something scanned. So for those documents the article is saying that you can have standard fonts referred to in the document, in which case copy-and-paste would work perfectly well or embedded fonts might be used in which case if no translation table is provided then copy-and-paste won't work (since all that's known is that this character is #47 in embedded font #2).

For a scanned document you have dots and their position. The OCR software then tries to make sense of the dots. For that to work it has to have some knowledge of the characters it should be looking for. So for Korean fonts it would need to know that it should be looking for something with some range of Korean fonts. I don't know if a range of Korean fonts is possible, but if it is then you may need to train it a little better.

I'd suggest starting with a single page of text and working with that. You should probably scan in 8-bit gray scale rather than the default (which is likely to be JPEG, which is probably the worst possible choice for text).

FWIW I tried ABBY about 5 years ago on some scans of (English) computer manuals and, while it was 80% correct, that was basically unusable (at 500 letters per page that would mean 100 corrections per page ... not practical for a 50 page manual). I've noticed that google does index publicly available scans and seems to do a good job, so perhaps serve up a page on the web, persuade google to index it and see how good that is?

Re: ScanSnap, OCR and non-latin scripts

Posted: Tue Feb 13, 2018 11:30 pm
by mcthulhu
MIBG, the question is not OCR accuracy, but rather getting to an output format more usable than PDF; accuracy is not relevant to the file types used. FineReader has 15 possible output formats, only two of which are PDF. ScanSnap, as far as I can tell (I don't have it), basically gives you a choice of PDF or JPEG.

If you can legally post your scanned book somewhere, I could try to take a look at it.

But if just leaving the Korean as image files and not bothering with OCR works for you, go for it. File sizes will be much larger, but that may not matter to you.

Re: ScanSnap, OCR and non-latin scripts

Posted: Wed Feb 14, 2018 8:41 pm
by Cainntear
My approach to scanning odd scripts/languages would be to just use images initially, and then plan to (but probably never get round to) manually transcribing them later. So if I was making Anki cards, maybe aiming to switch 5-10% of the days cards from jpg to text every day.

Re: ScanSnap, OCR and non-latin scripts

Posted: Tue Feb 20, 2018 2:17 am
by Notorious MIBG
Just an update since all of you were so kind to help in the first place: after fiddling with the settings and scanning this textbook 10 or so times I think I came up with a workflow that will work for me. It's not perfect, but I figure it will help me make flashcards significantly faster than I can now, as the bottle neck has always been and will likely be long into the future, my typing speed in Korean.

For the person who asked, here's an example page of the textbook that I'm OCR'ing, The Sounds of Korean:

pl9OgRo.png


With the job I've done now the sentences on the left-hand-side in Korean are able to be copied and pasted almost without flaw.

Re: ScanSnap, OCR and non-latin scripts

Posted: Tue Jul 31, 2018 5:57 pm
by zenmonkey
This post is a few months old but I thought I'd throw in my two cents to help out on the pdf to ocr front.

If you want to convert a pdf to text then one possibility is to use tesseract, the open-source ocr program from Google. You need to first convert the pdf file to an image and then pass the image file through tesseract.

Install ghostscript and run (I use 'brew ghostscript install')

Code: Select all

gs -dNOPAUSE -sDEVICE=tif -r300 -dJPEGQ=60 -sOutputFile=output.tif source.pdf -dBATCH

* note gs might be named 'gs64.exe' or something else in your system.

Install tesseract ('brew tesseract install')

Code: Select all

tesseract output.tif outtext -l due+eng


where the -l flag is used to list the language codes contained in the document