A language learners’ forum

Posted: **Sat Apr 08, 2017 3:42 pm**

I've mentioned this application I've been developing once or twice, and thought I'd post a few notes on how it's going. I've been using it for a while and will post it on GitHub as open source when it's a little more polished. Its main focus is on supporting reading, since that represents most of my language study, though it will also be a front end to a number of other language-related tools. I'm quite aware that its functionality overlaps a number of existing tools; but there were things I wanted to add for my own use.

It runs under NW.js (previously Node-Webkit), which allows developing desktop applications with JavaScript, HTML, and CSS; the internal database uses the WebSQL version of SQLite. There's also a Python interface of sorts (more later). To some extent this is a rewrite of a Firefox XUL extension I'd written for myself earlier.

The book formats supported for now are .epub and .txt; I have large numbers of books in both formats and can convert anything else that's not copy-protected to them. To do: run Calibre's conversion tool in the background to import other formats transparently, and just display the output (this would require that Calibre be installed). Ideally the book format should be more or less irrelevant. The tool can also open a URL and safely display its text content for reading.

The languages in the language menu are hard-coded for now, with English hard-coded as the user's language. To do: make this configurable by the user.

The user interface has changed a few times but the main part of the window currently has separate tabs for the text of the current section or chapter of the book, vocabulary, sentences, and text analysis. The Text tab has a toolbar and a table of contents drop-down menu above the text display. The toolbar includes buttons for dictionary searches, chapter navigation, zooming the font size in or out, enabling a virtual keyboard, etc. The Vocabulary and Sentences tabs are automatically populated when a chapter is displayed. The Vocabulary tab shows a table of each word (token) found in the chapter in order of frequency, and the Sentences tab shows a table of sentences, split with a hard-coded regular expression. Both have a filter-as-you-type search field that restricts the table to matching entries, so that you can see, for example, all sentences containing a given expression for which you want to make a flashcard. I currently have this set to fuzzy matches rather than exact matches; I haven't quite decided which is better. To do: add SRX (Segmentation Rule Exchange) support for better sentence segmentation; populate additional columns of the vocabulary and sentence tables with matches from the dictionary and translation memory databases, respectively. The latter would be for a parallel text display where possible. Also: display a mini-glossary for the selected sentence.

Dictionary search is currently done by highlighting a word or phrase and clicking a button. (To do: provide mouse-over definitions, with different colored highlighting for known words.) The default online dictionary is Glosbe; the tool extracts definitions and displays them in a more compact definition area. There are menu options to search Google Images (which is surprisingly effective for foreign language words), Word Reference, and Linguee instead, with those sites shown in a popup window for now (the easiest way: I may decide just to parse those results and show their definitions all in the same format). To do: add support for many more online dictionaries, and a way for users to add URL search templates without editing the source code; allow searching multiple dictionaries in parallel and collating the results. Etc.

The tool has a simple internal dictionary or glossary database to which you can add a term and definition; there are options to import and export dictionaries using tab-delimited text files (a format that should work with Anki). When a search is done, results from the internal dictionary are shown if they exist; otherwise the tool goes out to an online dictionary, which is only slightly slower. To do: expand the internal dictionary to support term base (TBX) fields, and add TBX import and export options. Not an immediate priority; simple is fine for now.

You can also, of course, enter a dictionary query from the keyboard, an option I use when I'm reading a hard-copy book and I'm just using this tool as a front end for dictionaries. Since my touch-typing is rudimentary outside a handful of languages, I've added an option to display a virtual keyboard attached to the search field. To do: create keyboard layouts for languages not already supported by the virtual-keyboard package. Esperanto is missing, for example.

The tool keeps track of what words you've looked up in the current section and can display them. To do: export just these words and their translations in a format that Anki can import; track how many times I've looked up the same word. I will add an option to review random flashcards, with examples automatically taken from the current text, and possibly with multiple choice options, but I am not planning to recreate an SRS tool. Anki is more than enough.

When a chapter is loaded, the status bar automatically displays some simple statistics on total words, number of unique words (tokens), number of sentences, and average sentence length. For other potential text analysis, collocation detection, part-of-speech annotation and syntax chunking, and possibly a WordNet and corpus interface where possible, I'll be using the Python Natural Language Tool Kit (NLTK) and possibly other Python libraries. I've sent text from this tool to NLTK, running a proof-of-concept test script in the background, and displaying the output in the Analysis tab of this tool. I'm quite pleased with how smoothly this integration works. To do: add more interesting sample scripts that can be run from my tool's menu by default, and add options to create and edit Python scripts from this tool. There are some Java packages I'd also like to experiment with, too (e.g., MALLET). These options will require that any other user also have Python and Java installed. I will probably also add an option for user scripts in JavaScript. It's an article of faith for me that tools need to be scriptable and customizable.

There are Tools menu options for searching Verbix for verb infinitives and conjugations, for getting a word's pronunciation from Forvo, for getting a machine translation of selected text from Google Translate, for searching the Leipzig corpora databases, etc.. Thesaurus searches might be useful to add, when WordNet is available for a language.

Although listening is not a high priority for me, I've been experimenting with various options available for text-to-speech from NW.js (read: Chromium) for selected passages. It doesn't work as well as Firefox with locally installed voices. Google TTS was working for a while through a Node.js package and then broke. I may look into AWS Polly, which I think has high-quality voices for 24 languages and has a 1500-character limit. Other users would need to have their own AWS credentials set up locally. I could always use Espeak as a fallback, I suppose, as I've done in the past.

The tool includes a translation memory database (bilingual corpus) to store parallel segments/sentences. Translation memory is a technology very familiar to me and supporting it is a priority. I want to allow bilingual concordance searches against the local TM for reference purposes to show how a word or expression is used in context, with translations. I'm currently retrieving sample sentence pairs online through the Glosbe TM API, and will be adding similar searches for Linguee, TAUS, Tatoeba, etc. I'll be able to import and import TMX files compatible with LF Aligner-produced TMX files; I don't plan to add any internal text alignment capabilities, at least for now. I don't need a lot of TM editing capability either, since this is not intended as a CAT tool for extensive translation work, even though it mirrors the typical database structure of a CAT tool; but it will have an option to enter user translations to add to the database. Another application of a TM, for self-testing, would be to let a user enter translations of sentences, either random or in order, and in either direction, and compare it to the "official" translation of a given sentence in the TM. I might add a secondary database to store user translations and their "corrections" for later searching while translating. I've also experimented in the past with deducing translations of words from a translation memory; and word alignment to help populate a glossary would be another potential TM application.

At some point I'll start storing some statistics. I've tracked reading speed, in terms of words per hour, in previous tools and will add a timer to do so again here - it would be useful to see averages for specific languages, and how much they improve over time. Percentages of words searched, and how often individual words have been looked up for a given amount of text, might also be nice to track.

My actual to-do list is somewhat longer, but this should give an idea of what I'm up to.

Posted: **Sat May 13, 2017 1:32 am**

Screen shot, with a French epub (actually a French translation of Lovecraft) loaded, and the French glossary and translation memory databases sort of populated. I have quite a ways to go, but the tool's becoming more and more useful, I think. The Dictionary menu currently has options for searching about half a dozen French dictionaries in addition to the default one, four of them monolingual; the Memory menu allows bilingual concordance searches on several resources; and the context (right-click) menu, so far, has options for TTS, machine translation, and lemma and part of speech information for the current sentence. The status bar shows some statistics on the current chapter, and the sizes of the local glossary and translation memory. I might be getting closer to posting a version on GitHub for other people to play with and possibly tweak, though I keep thinking of more things I'd like to get working first. I suspect that like most software (at least mine), it will never actually be "finished."

Jorkens screenshot.jpg

Posted: **Sat May 13, 2017 6:59 pm**

mcthulhu wrote:Screen shot, with a French epub (actually a French translation of Lovecraft) loaded, and the French glossary and translation memory databases sort of populated. I have quite a ways to go, but the tool's becoming more and more useful, I think. The Dictionary menu currently has options for searching about half a dozen French dictionaries in addition to the default one, four of them monolingual; the Memory menu allows bilingual concordance searches on several resources; and the context (right-click) menu, so far, has options for TTS, machine translation, and lemma and part of speech information for the current sentence. The status bar shows some statistics on the current chapter, and the sizes of the local glossary and translation memory. I might be getting closer to posting a version on GitHub for other people to play with and possibly tweak, though I keep thinking of more things I'd like to get working first. I suspect that like most software (at least mine), it will never actually be "finished."

Jorkens screenshot.jpg

Well done. A couple of thoughts.
First, somewhere I read that the first step in creating an app should be the writing of the manual.
Second, give some thought to the "library." Neither LWT nor (last I heard) LingQ nor (unless apparently you have a subscription) Readlang can cope with a sizeable number or more of texts read/audios heard. Or can't provide a smooth or adequate handle for finding what's there. Many users will need dozens, hundreds or even thousands (especially if learning by songs) of texts and/or audio to learn a language. If I'm squirreling away lots and lots of texts/audios, I'm going to want easily searchable directories and subdirectories for my stuff, and searchable by author, title or tag.

Posted: **Sun May 14, 2017 2:29 am**

Thanks for the suggestions. I'll have some documentation on GitHub when I post it there, covering at least installation and setup to start with.

Right now it keeps a recent files list of up to the last 25 books opened, showing the title and language, but otherwise it's only storing the location (file path) of the books and roughly where I am in them. It's not storing its own duplicate copies of the books themselves; it just reopens the original epub (or text file), wherever that happens to be. There's no upper limit on the number of books for which information is stored, though.

I have accumulated a very large collection of ebooks, but I can usually rely on Calibre for library management. Calibre does a good job of that, and if you haven't seen it I would recommend giving it a look. I wasn't really planning on duplicating Calibre or using this tool to index my book collection. It's possible I might add some sort of library management functionality in the future, maybe starting with a way for a user to search the history back beyond the 25 items shown, but I can't say that is a high priority for me right now, given all the other gaps I'm trying to fill. The database table that contains the history does have a "tags" field reserved for future use, though it's not being used yet.

Posted: **Mon May 22, 2017 5:47 pm**

initial commit at https://github.com/mcthulhu/jorkens

A language learners’ forum

progress report on reading application

progress report on reading application

Re: progress report on reading application

Re: progress report on reading application

Re: progress report on reading application

Re: progress report on reading application