paper on use of FSTs to expand coverage of electronic dictionaries

mcthulhu · Postby **mcthulhu** » Tue Aug 08, 2017 3:31 am

See http://www.ep.liu.se/ecp/085/010/ecp1385010.pdf, Using Finite State Transducers for Making Efficient Reading
Comprehension Dictionaries. The abstract notes that for morphologically complex languages, as little as 10% of the wordforms in running text might actually be lemmas that could be looked up directly in a bilingual dictionary (they later give the figure of 7.9% for North Saami), and using an FST dramatically expands the dictionary's coverage. The case studies described are for North Saami and South Saami, which sound like very complex languages (nouns are described as having about 80 different wordforms, including possessive suffixes). They got up to over 90% dictionary coverage by using FSTs, which is better than 10%. The paper also describes the applications supporting this.

Combining FSTs and dictionaries seems like the right way to approach a foreign language reading environment, and would make it a lot easier for a language learner to use an electronic dictionary. I hope to explore this further.

tommus · Postby **tommus** » Tue Aug 08, 2017 11:03 am

Very interesting article. I am not very familiar with morphologically complex languages, and not at all familiar with the Saami languages. But I am a bit familiar with the recently-created Atlas language being discussed here on this Forum. It uses suffixes, prefixes and even interfixes to produce numerous versions of a root word.

As I understand the article, the authors are promoting analytical algorithms to parse complex words to derive their meanings. Is that correct? Would not a different approach of generating in advance all the possible words and making that the dictionary be easier. I know that such an approach would produce a dictionary of more than a million words, but with today's technology, that is not a problem. So why not generate all the words in advance and then simply look them up when you need to? That is the approach I am using with Atlas. It is easy and fast. It does not produce the actual translations of the complex words. Rather, it produces the translation of the root word or compound root words, plus the meaning of the suffixes, prefixes and interfixes.

mcthulhu · Postby **mcthulhu** » Wed Aug 09, 2017 3:06 am

I think you're asking the question "Why Compute When Storage Is Cheap?," the title of one of the lecture slides in http://demo.clab.cs.cmu.edu/fa2015-1171 ... hology.pdf, about the use of FSTs in natural language processing. The slides might help to answer your question, but the answer may also depend on what you're doing and what the requirements of the task are.

"analytical algorithms to parse complex words to derive their meanings" - no, not really. The FSTs described for Saami are data structures used to parse inflected words to get back to their lemma (headword) forms, and then the corresponding dictionaries are used to look up the meanings. FSTs encode words in a very compact structure, sort of like a tree, where words with common substrings share parts of their paths through the tree. As an example, one German FST is described as containing "more than 830000 German full forms (with over 6.5 million different analyses) in only 1.2 MB." That many word forms and analyses would probably take up far more space as a lookup table. That example was taken from http://www.lrec-conf.org/proceedings/lr ... df/567.pdf, another paper which discusses the conversion of a "fuill-form lexicon" (the kind of data structure you are describing) for Arabic into an FST. FSTs are not only very compact, but also blazingly fast. Those two characteristics make them ideal for incorporation into larger language processing systems (morphological analysis tends to be part of a pipeline) where you might have large volumes of data and need to be able to support a high rate of throughput. That's why FSTs are pretty much the standard approach for morphology, and are also used in other NLP applications like real-time speech processing. If you''re only doing occasional single lookups for a single language, efficiency is irrelevant and a lookup table or database would probably work fine (assuming that you're not storing your million-entry databases on a platform like a smartphone, for instance - sometimes space does matter).

It's also possible that the generated lookup table approach might be easier with an artificial language with a regular structure and a predefined vocabulary than with a real language with numerous irregularities and exceptions, phonological variations when morphemes are put together, etc. The paper I mentioned in the first paragraph here gives some examples of complex language phenomena that FSTs can handle.

Chung · Postby **Chung** » Wed Aug 09, 2017 3:52 am

mcthulhu wrote:See http://www.ep.liu.se/ecp/085/010/ecp1385010.pdf, Using Finite State Transducers for Making Efficient Reading
Comprehension Dictionaries. The abstract notes that for morphologically complex languages, as little as 10% of the wordforms in running text might actually be lemmas that could be looked up directly in a bilingual dictionary (they later give the figure of 7.9% for North Saami), and using an FST dramatically expands the dictionary's coverage. The case studies described are for North Saami and South Saami, which sound like very complex languages (nouns are described as having about 80 different wordforms, including possessive suffixes). They got up to over 90% dictionary coverage by using FSTs, which is better than 10%. The paper also describes the applications supporting this.

Combining FSTs and dictionaries seems like the right way to approach a foreign language reading environment, and would make it a lot easier for a language learner to use an electronic dictionary. I hope to explore this further.

Interesting.

I guess for the authors whom I suspect are into computational linguistics, the pride of place accorded to developing an electronic dictionary is understandable. To someone like me who likes hard copies of things, I bristle a bit at the phrase "An effective digital dictionary is a necessity for language learners, but it is also important for Saami speakers:" in the paper's introduction. An effective dictionary of any kind is a necessity for language learners (and speakers of Saamic languages). Not everyone uses a tablet, smartphone or laptop with an internet connection to learn a language or look up words. Dare I say that it starts to make language learning edge close to "privilege" otherwise. Anyway, the best dictionary for a Saamic language that I've seen is at Giellatekno although it alternates for now only between Northern Saami on one side and Norwegian or Finnish on the other. What I actually find most valuable is its morphological analyzer which explicitly shows the declension of a nominal (including forms showing number, case and possession in one blow) and conjugation of a verb (including verbal nouns). As examples, here's the analyzer's full declension for guolli "fish" and the full conjugation for boahtit "to come".

Indeed the Saamic languages are inflectional nightmares (Southern Saami is a little easier than the others though since it doesn't use consonant gradation. That means one fewer set of alternations to worry about) The rules that one must grasp to inflect are akin to wading through tax accounting. It's not just the volume of endings available for marking number, case, tense or mood, but the stems that take on these endings are often subject to vowel alternations and/or consonant gradation depending on whether the stem has an odd or even number of syllables (let's not get into the "contracting stems"). Somehow I don't think that comparing it to Atlas does Saamic justice. Hell, I had no idea inflection could become this intricate until I delved into Inari Saami and Northern Saami, My background in Estonian and Finnish got me used to some of the principles of fusion and agglutination as well as consonant gradation, but I was left bewildered by how deeply Saamic applied them (maybe a background in Welsh might have slightly softened the blow because its initial consonant mutations are very vaguely like Saamic consonant gradation to me). The best that a paper dictionary of a Saamic dictionary could do without running of a space is to indicate the stem forms plus a few inflected forms beside the headwords. I don't think that it's necessary to show every form. Going back to guolli "fish", a paper dictionary could show at minimum the forms for genitive singular (i.e. guole/guoli) and genitive plural (i.e. guliid) in addition to the citation/dictionary form of nominative singular (i.e. guolli) since those genitive forms show the consonant gradation and vowel alternations used in the other inflected forms. The possessive endings are attached to forms marked for case and number in a "regular" way (that way becomes predictable or regular with practice :lol:

)

tommus · Postby **tommus** » Wed Aug 09, 2017 6:24 am

With my target languages, I obviously live in very simplified world compared with what Chung describes as the "inflectional nightmares" of Saami. I am now going to stop complaining about the Dutch and the Germans running a few root words together.

Having never seen a bilingual Saami-other language dictionary or grammar, and hearing about the complexities, I shudder at just how difficult it could be. One indication is that Verbix does not have Saami conjugations. Rather it has a very brief wiki: Verbix - Saami North. And here is a Sami - English vocabulary example:

addalit to give to someone else; to give away
addálas generous
addi giver
addilit to give quickly
addin giving
addit to give
addit [+A/G+] [+ILL] to give s.t. to s.o.
addit ándagassii to forgive; to pardon

Source

I can perhaps understand why Verbix doesn't yet has a conjugated North Saami.

So my question (which relates to what I am working on in Atlas), how does one go about doing machine translation without first expanding all the possible (or even all the likely) expansions of words and manually translating all of them? For a pop-up dictionary, for example, an application could show and translate the root word(s) and characterise all the prefixes, suffixes and interfixes, but that is going to impose a huge challenge to a language learner. I guess a rather comprehensive grammar course would be an absolute necessity.

tommus · Postby **tommus** » Wed Aug 09, 2017 6:50 am

mcthulhu wrote:I think you're asking the question "Why Compute When Storage Is Cheap?," ... The slides might help to answer your question

Those slides are excellent. But I wonder just how simple is "simple": quote - "Finite state methods provide a simple and powerful means of generating and analyzing words (as well as the phonological alternations that accompany word formation/inflection)". At first glance, FSTs seem to have a steep learning curve. But I am encouraged by another statement in the slides: "One brilliant aspect of using FSTs for morphology: the
same code can handle both analysis and generation." I'll have a look at some of the tools listed in the slides.

A thought that I have been expressing about Atlas is that as, much as possible, the complexities of grammar should be avoided in an invented, minimising the difficulty that people and computers have in understanding it and using it. But of course, an invented language must have gateways into natural languages, so I can clearly understand the power of FSTs.

mcthulhu · Postby **mcthulhu** » Wed Aug 09, 2017 2:21 pm

Using an FST that someone else has already compiled is definitely a lot simpler than writing your own; but that's something I want to look into as well at some point. The XFST/SFST/HFST etc. language is a superset of regular expressions, which you're probably familiar with. Of course, knowing how to describe the linguistic patterns you're matching is a prerequisite...

The XFST book is available online, and I think that there's a pretty good tutorial for SFST. HFST is well documented too.

There's a Web interface to Xerox FST tools for nine languages at https://open.xerox.com/Services/fst-nlp-tools, including German, but unfortunately not Dutch, sorry. It's nice to play with, anyway.

Postby **Iversen** » Wed Aug 09, 2017 9:25 pm

tommus wrote:(...) why not generate all the words in advance and then simply look them up when you need to?

I have actually proposed something similar as a way to improve Google Translate: take a thick dictionary (or seven) per language, run all words in it (/them) through a morphologic machine like Verbix (for verbs) to get ALL forms of the ALL words safely into the database, add the relevant morphological markers and a bit of wordbound syntax (like the choice of preposition after English verbs or case in other languages) .. and then you avoid a lot of cases where words are ignored because they can't be recognized by the system or they are presented in totally wrong morphological forms.

You could first assume that all words are regular, form their forms and then run all those forms through a mechanism where irregularities overwrite falsefy formed regular forms.

tommus · Postby **tommus** » Wed Aug 09, 2017 11:38 pm

Iversen wrote:You could first assume that all words are regular, form their forms and then run all those forms through a mechanism where irregularities overwrite falsefy formed regular forms.

And Google Translate could use its user feedback mechanism to produce and/or confirm the translation that fits the forms and to correct or improve on translations that are not quite right.

A language learners’ forum

paper on use of FSTs to expand coverage of electronic dictionaries

paper on use of FSTs to expand coverage of electronic dictionaries

Re: paper on use of FSTs to expand coverage of electronic dictionaries

Re: paper on use of FSTs to expand coverage of electronic dictionaries

Re: paper on use of FSTs to expand coverage of electronic dictionaries

Re: paper on use of FSTs to expand coverage of electronic dictionaries

Re: paper on use of FSTs to expand coverage of electronic dictionaries

Re: paper on use of FSTs to expand coverage of electronic dictionaries

Re: paper on use of FSTs to expand coverage of electronic dictionaries

Re: paper on use of FSTs to expand coverage of electronic dictionaries

Who is online