I'm using the ScanSnap manager with my Fujitsu ScanSnap to OCR (Edit: It uses the ABBYY FineReader engine) a Korean language book I've unbound in order to be able to copy and paste Korean sentences into Anki or other electronic documents, which would otherwise take me an enormous amount of time to type out.
I did the scan today and set the language under the "File Option" tab to Korean and target pages to "All Pages", and of course checked "Convert to Searchable PDF". I turned up the resolution and turned down the compression. In the end I got a beautiful PDF that, when I copied Korean text that was easily readable and on a single line, came out as complete gibberish when pasted into a text file.
Has anybody had success with non-latin foreign language scripts and OCR?
I'm at a loss.
Best,
MIBG
ScanSnap, OCR and non-latin scripts
-
- White Belt
- Posts: 21
- Joined: Tue Oct 10, 2017 1:50 am
- Languages: Native: English
Beginner: Korean - Language Log: https://forum.language-learners.org/vie ... =15&t=7752
- x 28
ScanSnap, OCR and non-latin scripts
Last edited by Notorious MIBG on Tue Feb 13, 2018 10:51 am, edited 1 time in total.
0 x
Pimsleur Korean 1 + 2:
Building KOLUP 1000 Sentence Anki Deck:
Building KOLUP 1000 Sentence Anki Deck:
-
- Orange Belt
- Posts: 228
- Joined: Sun Feb 26, 2017 4:01 pm
- Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
- x 590
Re: ScanSnap, OCR and non-latin scripts
PDF is almost certainly the problem. This is a very common issue with PDF files, and I've run into it myself before. I think PDF is supposed to be a final product for viewing, and PDF is usually not a very good input format for follow-on processing. For some background, see https://stackoverflow.com/questions/121 ... arbled-pdf, specifically:
"Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.
"Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes)."
That sounds like what you've got going on. You might try some utility designed to extract text from PDFs. There are many options out there. One is the pdftotext utility - see https://en.wikipedia.org/wiki/Pdftotext.
Otherwise - Does ScanSnap offer any better file save options other than PDF? Ideally you'd want plain text, or Word. If you can at least save page images, you could run those through another OCR program, though that might be more work. Depending on the program, you might be able to OCR your PDF as a single file and save some time. ABBYY FineReader is good. There are lots of options for Korean OCR, if you Google for it, including some smartphone apps. http://www.i2ocr.com/free-online-korean-ocr could help; I haven't tried it, but it's free.
I've used the open source Capture2Text before, and it apparently supports Korean too: http://capture2text.sourceforge.net/. It runs under Windows, and I think it's a wrapper around Tesseract. If you're on Linux, OCRFeeder might work (another front end for Tesseract).
BTW, typing out a book would do wonders for your typing speed...
"Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.
"Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes)."
That sounds like what you've got going on. You might try some utility designed to extract text from PDFs. There are many options out there. One is the pdftotext utility - see https://en.wikipedia.org/wiki/Pdftotext.
Otherwise - Does ScanSnap offer any better file save options other than PDF? Ideally you'd want plain text, or Word. If you can at least save page images, you could run those through another OCR program, though that might be more work. Depending on the program, you might be able to OCR your PDF as a single file and save some time. ABBYY FineReader is good. There are lots of options for Korean OCR, if you Google for it, including some smartphone apps. http://www.i2ocr.com/free-online-korean-ocr could help; I haven't tried it, but it's free.
I've used the open source Capture2Text before, and it apparently supports Korean too: http://capture2text.sourceforge.net/. It runs under Windows, and I think it's a wrapper around Tesseract. If you're on Linux, OCRFeeder might work (another front end for Tesseract).
BTW, typing out a book would do wonders for your typing speed...
3 x
- rdearman
- Site Admin
- Posts: 7231
- Joined: Thu May 14, 2015 4:18 pm
- Location: United Kingdom
- Languages: English (N)
- Language Log: viewtopic.php?f=15&t=1836
- x 23128
- Contact:
Re: ScanSnap, OCR and non-latin scripts
I remember another member EMK used anki with ancient Egyptian hieroglyphics and he simply took a photo of the relevant sections and used the image as the prompt, since anki supports photos as well as other media. It would be quicker I think.
1 x
: Read 150 books in 2024
My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter
I post on this forum with mobile devices, so excuse short msgs and typos.
My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter
I post on this forum with mobile devices, so excuse short msgs and typos.
-
- White Belt
- Posts: 21
- Joined: Tue Oct 10, 2017 1:50 am
- Languages: Native: English
Beginner: Korean - Language Log: https://forum.language-learners.org/vie ... =15&t=7752
- x 28
Re: ScanSnap, OCR and non-latin scripts
mcthulhu wrote:PDF is almost certainly the problem. This is a very common issue with PDF files, and I've run into it myself before. I think PDF is supposed to be a final product for viewing, and PDF is usually not a very good input format for follow-on processing. For some background, see https://stackoverflow.com/questions/121 ... arbled-pdf, specifically:
"Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.
"Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes)."
That sounds like what you've got going on. You might try some utility designed to extract text from PDFs. There are many options out there. One is the pdftotext utility - see https://en.wikipedia.org/wiki/Pdftotext.
Otherwise - Does ScanSnap offer any better file save options other than PDF? Ideally you'd want plain text, or Word. If you can at least save page images, you could run those through another OCR program, though that might be more work. Depending on the program, you might be able to OCR your PDF as a single file and save some time. ABBYY FineReader is good. There are lots of options for Korean OCR, if you Google for it, including some smartphone apps. http://www.i2ocr.com/free-online-korean-ocr could help; I haven't tried it, but it's free.
I've used the open source Capture2Text before, and it apparently supports Korean too: http://capture2text.sourceforge.net/. It runs under Windows, and I think it's a wrapper around Tesseract. If you're on Linux, OCRFeeder might work (another front end for Tesseract).
BTW, typing out a book would do wonders for your typing speed...
Thanks for the information. Unfortunately, ScanSnap uses the ABBYY FineReader engine. Normally that would be a fortunate thing, but since it's considered "the best" and isn't working, it seems particularly unfortunate...
I'm going to try to tweak some more setting to see if I can get any better outcome. Otherwise, I might do what rdearman suggested and just use images.
0 x
Pimsleur Korean 1 + 2:
Building KOLUP 1000 Sentence Anki Deck:
Building KOLUP 1000 Sentence Anki Deck:
-
- Blue Belt
- Posts: 559
- Joined: Tue Aug 04, 2015 8:11 pm
- Location: Abingdon, UK
- Languages: Italian (N), English (N), French (poor, not studying), Japanese (studying, JLPT N3)
- x 609
Re: ScanSnap, OCR and non-latin scripts
I don't know why ABBY isn't working for you but the stackoverflow answer is referring to PDF files that produced from other document formats, not something scanned. So for those documents the article is saying that you can have standard fonts referred to in the document, in which case copy-and-paste would work perfectly well or embedded fonts might be used in which case if no translation table is provided then copy-and-paste won't work (since all that's known is that this character is #47 in embedded font #2).
For a scanned document you have dots and their position. The OCR software then tries to make sense of the dots. For that to work it has to have some knowledge of the characters it should be looking for. So for Korean fonts it would need to know that it should be looking for something with some range of Korean fonts. I don't know if a range of Korean fonts is possible, but if it is then you may need to train it a little better.
I'd suggest starting with a single page of text and working with that. You should probably scan in 8-bit gray scale rather than the default (which is likely to be JPEG, which is probably the worst possible choice for text).
FWIW I tried ABBY about 5 years ago on some scans of (English) computer manuals and, while it was 80% correct, that was basically unusable (at 500 letters per page that would mean 100 corrections per page ... not practical for a 50 page manual). I've noticed that google does index publicly available scans and seems to do a good job, so perhaps serve up a page on the web, persuade google to index it and see how good that is?
For a scanned document you have dots and their position. The OCR software then tries to make sense of the dots. For that to work it has to have some knowledge of the characters it should be looking for. So for Korean fonts it would need to know that it should be looking for something with some range of Korean fonts. I don't know if a range of Korean fonts is possible, but if it is then you may need to train it a little better.
I'd suggest starting with a single page of text and working with that. You should probably scan in 8-bit gray scale rather than the default (which is likely to be JPEG, which is probably the worst possible choice for text).
FWIW I tried ABBY about 5 years ago on some scans of (English) computer manuals and, while it was 80% correct, that was basically unusable (at 500 letters per page that would mean 100 corrections per page ... not practical for a 50 page manual). I've noticed that google does index publicly available scans and seems to do a good job, so perhaps serve up a page on the web, persuade google to index it and see how good that is?
1 x
新完全マスター N2聴解 | : | 新完全マスター N2読解 | : |
新完全マスター N2文法 | : | TY Comp. German | : |
-
- Orange Belt
- Posts: 228
- Joined: Sun Feb 26, 2017 4:01 pm
- Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
- x 590
Re: ScanSnap, OCR and non-latin scripts
MIBG, the question is not OCR accuracy, but rather getting to an output format more usable than PDF; accuracy is not relevant to the file types used. FineReader has 15 possible output formats, only two of which are PDF. ScanSnap, as far as I can tell (I don't have it), basically gives you a choice of PDF or JPEG.
If you can legally post your scanned book somewhere, I could try to take a look at it.
But if just leaving the Korean as image files and not bothering with OCR works for you, go for it. File sizes will be much larger, but that may not matter to you.
If you can legally post your scanned book somewhere, I could try to take a look at it.
But if just leaving the Korean as image files and not bothering with OCR works for you, go for it. File sizes will be much larger, but that may not matter to you.
0 x
-
- Black Belt - 3rd Dan
- Posts: 3472
- Joined: Thu Jul 30, 2015 11:04 am
- Location: Scotland
- Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc - x 8672
- Contact:
Re: ScanSnap, OCR and non-latin scripts
My approach to scanning odd scripts/languages would be to just use images initially, and then plan to (but probably never get round to) manually transcribing them later. So if I was making Anki cards, maybe aiming to switch 5-10% of the days cards from jpg to text every day.
1 x
-
- White Belt
- Posts: 21
- Joined: Tue Oct 10, 2017 1:50 am
- Languages: Native: English
Beginner: Korean - Language Log: https://forum.language-learners.org/vie ... =15&t=7752
- x 28
Re: ScanSnap, OCR and non-latin scripts
Just an update since all of you were so kind to help in the first place: after fiddling with the settings and scanning this textbook 10 or so times I think I came up with a workflow that will work for me. It's not perfect, but I figure it will help me make flashcards significantly faster than I can now, as the bottle neck has always been and will likely be long into the future, my typing speed in Korean.
For the person who asked, here's an example page of the textbook that I'm OCR'ing, The Sounds of Korean:
With the job I've done now the sentences on the left-hand-side in Korean are able to be copied and pasted almost without flaw.
For the person who asked, here's an example page of the textbook that I'm OCR'ing, The Sounds of Korean:
With the job I've done now the sentences on the left-hand-side in Korean are able to be copied and pasted almost without flaw.
You do not have the required permissions to view the files attached to this post.
0 x
Pimsleur Korean 1 + 2:
Building KOLUP 1000 Sentence Anki Deck:
Building KOLUP 1000 Sentence Anki Deck:
- zenmonkey
- Black Belt - 2nd Dan
- Posts: 2528
- Joined: Sun Jul 26, 2015 7:21 pm
- Location: California, Germany and France
- Languages: Spanish, English, French trilingual - German (B2/C1) on/off study: Persian, Hebrew, Tibetan, Setswana.
Some knowledge of Italian, Portuguese, Ladino, Yiddish ...
Want to tackle Tzotzil, Nahuatl - Language Log: viewtopic.php?f=15&t=859
- x 7030
- Contact:
Re: ScanSnap, OCR and non-latin scripts
This post is a few months old but I thought I'd throw in my two cents to help out on the pdf to ocr front.
If you want to convert a pdf to text then one possibility is to use tesseract, the open-source ocr program from Google. You need to first convert the pdf file to an image and then pass the image file through tesseract.
Install ghostscript and run (I use 'brew ghostscript install')
* note gs might be named 'gs64.exe' or something else in your system.
Install tesseract ('brew tesseract install')
where the -l flag is used to list the language codes contained in the document
If you want to convert a pdf to text then one possibility is to use tesseract, the open-source ocr program from Google. You need to first convert the pdf file to an image and then pass the image file through tesseract.
Install ghostscript and run (I use 'brew ghostscript install')
Code: Select all
gs -dNOPAUSE -sDEVICE=tif -r300 -dJPEGQ=60 -sOutputFile=output.tif source.pdf -dBATCH
* note gs might be named 'gs64.exe' or something else in your system.
Install tesseract ('brew tesseract install')
Code: Select all
tesseract output.tif outtext -l due+eng
where the -l flag is used to list the language codes contained in the document
0 x
I am a leaf on the wind, watch how I soar
Return to “Practical Questions and Advice”
Who is online
Users browsing this forum: fromaalborg, tommus and 2 guests