progress report on reading application
Posted: Sat Apr 08, 2017 3:42 pm
I've mentioned this application I've been developing once or twice, and thought I'd post a few notes on how it's going. I've been using it for a while and will post it on GitHub as open source when it's a little more polished. Its main focus is on supporting reading, since that represents most of my language study, though it will also be a front end to a number of other language-related tools. I'm quite aware that its functionality overlaps a number of existing tools; but there were things I wanted to add for my own use.
It runs under NW.js (previously Node-Webkit), which allows developing desktop applications with JavaScript, HTML, and CSS; the internal database uses the WebSQL version of SQLite. There's also a Python interface of sorts (more later). To some extent this is a rewrite of a Firefox XUL extension I'd written for myself earlier.
The book formats supported for now are .epub and .txt; I have large numbers of books in both formats and can convert anything else that's not copy-protected to them. To do: run Calibre's conversion tool in the background to import other formats transparently, and just display the output (this would require that Calibre be installed). Ideally the book format should be more or less irrelevant. The tool can also open a URL and safely display its text content for reading.
The languages in the language menu are hard-coded for now, with English hard-coded as the user's language. To do: make this configurable by the user.
The user interface has changed a few times but the main part of the window currently has separate tabs for the text of the current section or chapter of the book, vocabulary, sentences, and text analysis. The Text tab has a toolbar and a table of contents drop-down menu above the text display. The toolbar includes buttons for dictionary searches, chapter navigation, zooming the font size in or out, enabling a virtual keyboard, etc. The Vocabulary and Sentences tabs are automatically populated when a chapter is displayed. The Vocabulary tab shows a table of each word (token) found in the chapter in order of frequency, and the Sentences tab shows a table of sentences, split with a hard-coded regular expression. Both have a filter-as-you-type search field that restricts the table to matching entries, so that you can see, for example, all sentences containing a given expression for which you want to make a flashcard. I currently have this set to fuzzy matches rather than exact matches; I haven't quite decided which is better. To do: add SRX (Segmentation Rule Exchange) support for better sentence segmentation; populate additional columns of the vocabulary and sentence tables with matches from the dictionary and translation memory databases, respectively. The latter would be for a parallel text display where possible. Also: display a mini-glossary for the selected sentence.
Dictionary search is currently done by highlighting a word or phrase and clicking a button. (To do: provide mouse-over definitions, with different colored highlighting for known words.) The default online dictionary is Glosbe; the tool extracts definitions and displays them in a more compact definition area. There are menu options to search Google Images (which is surprisingly effective for foreign language words), Word Reference, and Linguee instead, with those sites shown in a popup window for now (the easiest way: I may decide just to parse those results and show their definitions all in the same format). To do: add support for many more online dictionaries, and a way for users to add URL search templates without editing the source code; allow searching multiple dictionaries in parallel and collating the results. Etc.
The tool has a simple internal dictionary or glossary database to which you can add a term and definition; there are options to import and export dictionaries using tab-delimited text files (a format that should work with Anki). When a search is done, results from the internal dictionary are shown if they exist; otherwise the tool goes out to an online dictionary, which is only slightly slower. To do: expand the internal dictionary to support term base (TBX) fields, and add TBX import and export options. Not an immediate priority; simple is fine for now.
You can also, of course, enter a dictionary query from the keyboard, an option I use when I'm reading a hard-copy book and I'm just using this tool as a front end for dictionaries. Since my touch-typing is rudimentary outside a handful of languages, I've added an option to display a virtual keyboard attached to the search field. To do: create keyboard layouts for languages not already supported by the virtual-keyboard package. Esperanto is missing, for example.
The tool keeps track of what words you've looked up in the current section and can display them. To do: export just these words and their translations in a format that Anki can import; track how many times I've looked up the same word. I will add an option to review random flashcards, with examples automatically taken from the current text, and possibly with multiple choice options, but I am not planning to recreate an SRS tool. Anki is more than enough.
When a chapter is loaded, the status bar automatically displays some simple statistics on total words, number of unique words (tokens), number of sentences, and average sentence length. For other potential text analysis, collocation detection, part-of-speech annotation and syntax chunking, and possibly a WordNet and corpus interface where possible, I'll be using the Python Natural Language Tool Kit (NLTK) and possibly other Python libraries. I've sent text from this tool to NLTK, running a proof-of-concept test script in the background, and displaying the output in the Analysis tab of this tool. I'm quite pleased with how smoothly this integration works. To do: add more interesting sample scripts that can be run from my tool's menu by default, and add options to create and edit Python scripts from this tool. There are some Java packages I'd also like to experiment with, too (e.g., MALLET). These options will require that any other user also have Python and Java installed. I will probably also add an option for user scripts in JavaScript. It's an article of faith for me that tools need to be scriptable and customizable.
There are Tools menu options for searching Verbix for verb infinitives and conjugations, for getting a word's pronunciation from Forvo, for getting a machine translation of selected text from Google Translate, for searching the Leipzig corpora databases, etc.. Thesaurus searches might be useful to add, when WordNet is available for a language.
Although listening is not a high priority for me, I've been experimenting with various options available for text-to-speech from NW.js (read: Chromium) for selected passages. It doesn't work as well as Firefox with locally installed voices. Google TTS was working for a while through a Node.js package and then broke. I may look into AWS Polly, which I think has high-quality voices for 24 languages and has a 1500-character limit. Other users would need to have their own AWS credentials set up locally. I could always use Espeak as a fallback, I suppose, as I've done in the past.
The tool includes a translation memory database (bilingual corpus) to store parallel segments/sentences. Translation memory is a technology very familiar to me and supporting it is a priority. I want to allow bilingual concordance searches against the local TM for reference purposes to show how a word or expression is used in context, with translations. I'm currently retrieving sample sentence pairs online through the Glosbe TM API, and will be adding similar searches for Linguee, TAUS, Tatoeba, etc. I'll be able to import and import TMX files compatible with LF Aligner-produced TMX files; I don't plan to add any internal text alignment capabilities, at least for now. I don't need a lot of TM editing capability either, since this is not intended as a CAT tool for extensive translation work, even though it mirrors the typical database structure of a CAT tool; but it will have an option to enter user translations to add to the database. Another application of a TM, for self-testing, would be to let a user enter translations of sentences, either random or in order, and in either direction, and compare it to the "official" translation of a given sentence in the TM. I might add a secondary database to store user translations and their "corrections" for later searching while translating. I've also experimented in the past with deducing translations of words from a translation memory; and word alignment to help populate a glossary would be another potential TM application.
At some point I'll start storing some statistics. I've tracked reading speed, in terms of words per hour, in previous tools and will add a timer to do so again here - it would be useful to see averages for specific languages, and how much they improve over time. Percentages of words searched, and how often individual words have been looked up for a given amount of text, might also be nice to track.
My actual to-do list is somewhat longer, but this should give an idea of what I'm up to.
It runs under NW.js (previously Node-Webkit), which allows developing desktop applications with JavaScript, HTML, and CSS; the internal database uses the WebSQL version of SQLite. There's also a Python interface of sorts (more later). To some extent this is a rewrite of a Firefox XUL extension I'd written for myself earlier.
The book formats supported for now are .epub and .txt; I have large numbers of books in both formats and can convert anything else that's not copy-protected to them. To do: run Calibre's conversion tool in the background to import other formats transparently, and just display the output (this would require that Calibre be installed). Ideally the book format should be more or less irrelevant. The tool can also open a URL and safely display its text content for reading.
The languages in the language menu are hard-coded for now, with English hard-coded as the user's language. To do: make this configurable by the user.
The user interface has changed a few times but the main part of the window currently has separate tabs for the text of the current section or chapter of the book, vocabulary, sentences, and text analysis. The Text tab has a toolbar and a table of contents drop-down menu above the text display. The toolbar includes buttons for dictionary searches, chapter navigation, zooming the font size in or out, enabling a virtual keyboard, etc. The Vocabulary and Sentences tabs are automatically populated when a chapter is displayed. The Vocabulary tab shows a table of each word (token) found in the chapter in order of frequency, and the Sentences tab shows a table of sentences, split with a hard-coded regular expression. Both have a filter-as-you-type search field that restricts the table to matching entries, so that you can see, for example, all sentences containing a given expression for which you want to make a flashcard. I currently have this set to fuzzy matches rather than exact matches; I haven't quite decided which is better. To do: add SRX (Segmentation Rule Exchange) support for better sentence segmentation; populate additional columns of the vocabulary and sentence tables with matches from the dictionary and translation memory databases, respectively. The latter would be for a parallel text display where possible. Also: display a mini-glossary for the selected sentence.
Dictionary search is currently done by highlighting a word or phrase and clicking a button. (To do: provide mouse-over definitions, with different colored highlighting for known words.) The default online dictionary is Glosbe; the tool extracts definitions and displays them in a more compact definition area. There are menu options to search Google Images (which is surprisingly effective for foreign language words), Word Reference, and Linguee instead, with those sites shown in a popup window for now (the easiest way: I may decide just to parse those results and show their definitions all in the same format). To do: add support for many more online dictionaries, and a way for users to add URL search templates without editing the source code; allow searching multiple dictionaries in parallel and collating the results. Etc.
The tool has a simple internal dictionary or glossary database to which you can add a term and definition; there are options to import and export dictionaries using tab-delimited text files (a format that should work with Anki). When a search is done, results from the internal dictionary are shown if they exist; otherwise the tool goes out to an online dictionary, which is only slightly slower. To do: expand the internal dictionary to support term base (TBX) fields, and add TBX import and export options. Not an immediate priority; simple is fine for now.
You can also, of course, enter a dictionary query from the keyboard, an option I use when I'm reading a hard-copy book and I'm just using this tool as a front end for dictionaries. Since my touch-typing is rudimentary outside a handful of languages, I've added an option to display a virtual keyboard attached to the search field. To do: create keyboard layouts for languages not already supported by the virtual-keyboard package. Esperanto is missing, for example.
The tool keeps track of what words you've looked up in the current section and can display them. To do: export just these words and their translations in a format that Anki can import; track how many times I've looked up the same word. I will add an option to review random flashcards, with examples automatically taken from the current text, and possibly with multiple choice options, but I am not planning to recreate an SRS tool. Anki is more than enough.
When a chapter is loaded, the status bar automatically displays some simple statistics on total words, number of unique words (tokens), number of sentences, and average sentence length. For other potential text analysis, collocation detection, part-of-speech annotation and syntax chunking, and possibly a WordNet and corpus interface where possible, I'll be using the Python Natural Language Tool Kit (NLTK) and possibly other Python libraries. I've sent text from this tool to NLTK, running a proof-of-concept test script in the background, and displaying the output in the Analysis tab of this tool. I'm quite pleased with how smoothly this integration works. To do: add more interesting sample scripts that can be run from my tool's menu by default, and add options to create and edit Python scripts from this tool. There are some Java packages I'd also like to experiment with, too (e.g., MALLET). These options will require that any other user also have Python and Java installed. I will probably also add an option for user scripts in JavaScript. It's an article of faith for me that tools need to be scriptable and customizable.
There are Tools menu options for searching Verbix for verb infinitives and conjugations, for getting a word's pronunciation from Forvo, for getting a machine translation of selected text from Google Translate, for searching the Leipzig corpora databases, etc.. Thesaurus searches might be useful to add, when WordNet is available for a language.
Although listening is not a high priority for me, I've been experimenting with various options available for text-to-speech from NW.js (read: Chromium) for selected passages. It doesn't work as well as Firefox with locally installed voices. Google TTS was working for a while through a Node.js package and then broke. I may look into AWS Polly, which I think has high-quality voices for 24 languages and has a 1500-character limit. Other users would need to have their own AWS credentials set up locally. I could always use Espeak as a fallback, I suppose, as I've done in the past.
The tool includes a translation memory database (bilingual corpus) to store parallel segments/sentences. Translation memory is a technology very familiar to me and supporting it is a priority. I want to allow bilingual concordance searches against the local TM for reference purposes to show how a word or expression is used in context, with translations. I'm currently retrieving sample sentence pairs online through the Glosbe TM API, and will be adding similar searches for Linguee, TAUS, Tatoeba, etc. I'll be able to import and import TMX files compatible with LF Aligner-produced TMX files; I don't plan to add any internal text alignment capabilities, at least for now. I don't need a lot of TM editing capability either, since this is not intended as a CAT tool for extensive translation work, even though it mirrors the typical database structure of a CAT tool; but it will have an option to enter user translations to add to the database. Another application of a TM, for self-testing, would be to let a user enter translations of sentences, either random or in order, and in either direction, and compare it to the "official" translation of a given sentence in the TM. I might add a secondary database to store user translations and their "corrections" for later searching while translating. I've also experimented in the past with deducing translations of words from a translation memory; and word alignment to help populate a glossary would be another potential TM application.
At some point I'll start storing some statistics. I've tracked reading speed, in terms of words per hour, in previous tools and will add a timer to do so again here - it would be useful to see averages for specific languages, and how much they improve over time. Percentages of words searched, and how often individual words have been looked up for a given amount of text, might also be nice to track.
My actual to-do list is somewhat longer, but this should give an idea of what I'm up to.