Lemmatization tools for Goldlist

General discussion about learning languages
User avatar
Obogrew
White Belt
Posts: 34
Joined: Fri Jan 29, 2016 4:00 am
Location: Spain
Languages: Russian(N), German(C2), English(C1), Hebrew(B1), Serbian(A2), French(A1), Turkish(A1), Spanish(B1)
x 21

Lemmatization tools for Goldlist

Postby Obogrew » Sat Jan 30, 2016 6:42 am

I've recently discovered GLM, Anki, HTLAL etc. started to try all methods to find out
what is the best one.
I came to an idea to create a word list based on text that I am going to read.

For example, I would like to read a book and have it in digital format. I don't want to be distracted during reading, lookup new words in dictionary, mark unknown words and read the text again. Therefore I would like to create a list of words in base form. Then divide all words into 2 groups: "known words", "unknown words". Then I will "goldlist" or "SRS" all unknown words for 1-2 months, and when ready-read the book.

For second book-it will be more easy, using dictionary from the first book I can filter out all processed words very easy.

For that reason I have to convert every single word into "Base form". I found out that Base Form is
called "Lemma" and the conversion process is called "Lemmatization."

The question is, what is the easiest way to do lemmatization. I was able to find it
for Romanic language(google:meaningcloud), but I will have to develop a lemmatization
tool using their APIs, that will cost time and efforts. And I am afraid that I try to reinvent the wheel.

My expectation from 'Lemmatizing tool' are:

Out of the sentence "I will be looking carefully after my children and take it over
from my wife that she takes advantage
"

it creates a wordlist = {I, to look after, careful,child, to take over, wife} + 'to
take advantage
' would be real advantage.

and in German "Ich hatte vor dir in Stille etwas zu sagen und mach' mein Türchen schnell
zu.

wordlist={ich, vorhaben, Stille, etwas, sagen, zumachen, Tür, schnell}

Question:
1.Are my expectation about Lemmatization tools and libraries too optimistic or there are tools that would have so much intelligence.
2. Am I trying to reinvent the wheel? Is there any other way to list all words?
1 x

Hork
White Belt
Posts: 11
Joined: Sun Jan 10, 2016 6:57 pm
x 8

Re: Lemmatization tools for Goldlist

Postby Hork » Mon Feb 01, 2016 10:51 am

Back in the day I too was psyched about the idea of counting AND lemmatizing words in my corpora. In reality the existing programs are not able to spot either phrasal verbs (English) or separable verbs (German) and join them into one headword even if texts are POS-tagged. It seems easier to put people on Mars than to do just that.
1 x

User avatar
Bakunin
Orange Belt
Posts: 245
Joined: Sun Jul 19, 2015 5:11 pm
Location: Zürich
Languages: German (N), English, Thai, Swiss-German (adv.), Khmer, Isaan (studying); dormant: French, Polish
x 660
Contact:

Re: Lemmatization tools for Goldlist

Postby Bakunin » Mon Feb 01, 2016 12:10 pm

I can't comment on IE languages - and Hork might be right alluding to space travel - but for Turkish, which I see is one of your target languages, there's TRmorph, a relatively good morphological analyzer. While you're at it, check out TS Corpus as well.

I built my own corpus of curated texts for Thai as an intermediate student, and I'm in the process of doing the same for Khmer. Both languages are isolating which makes lemmatization trivial 8-)
0 x

Hork
White Belt
Posts: 11
Joined: Sun Jan 10, 2016 6:57 pm
x 8

Re: Lemmatization tools for Goldlist

Postby Hork » Mon Feb 01, 2016 3:22 pm

I stand corrected. Das Institut für Deutsche Sprache http://www.ids-mannheim.de/kl/projekte/methoden/derewo.html seems to have managed to accurately count separable verbs in their 2009 lemma frequency list (but oddly not in their newer versions). For example, they listed "ankündigen" and "hinzufügen" as more frequent than "kündigen" and "fügen" which holds true for nonfiction (their main corpus source). So if they had simply lemmatized a word list these verbs would have been listed the other way around.
0 x

User avatar
Obogrew
White Belt
Posts: 34
Joined: Fri Jan 29, 2016 4:00 am
Location: Spain
Languages: Russian(N), German(C2), English(C1), Hebrew(B1), Serbian(A2), French(A1), Turkish(A1), Spanish(B1)
x 21

Re: Lemmatization tools for Goldlist

Postby Obogrew » Sun Feb 07, 2016 8:38 am

Thanks.
Looking through the IDS page I could not find any software. What I would need is the software or library to build а tool. Currently I am not focused on Turkish, but probably later I will use the link.

Any other ideas how to prepare for reading a book, in order not to stumble over unknown words? Or how to get a distilled list of unknown words I will encounter in the particular text? I am definitely not first and even not tenth who came across this idea.

Or nothing exists that is better than adapted text with a dictionary at the end?
0 x

Doitsujin
Green Belt
Posts: 402
Joined: Sat Jul 18, 2015 6:21 pm
Languages: German (N)
x 801

Re: Lemmatization tools for Goldlist

Postby Doitsujin » Sun Feb 07, 2016 11:00 am

Obogrew wrote:Or nothing exists that is better than adapted text with a dictionary at the end?

NLTK comes with a couple of lemmatizers and stemmers, but the results are pretty much useless for most languages other than English.

IMHO, you'll usually get much better results with the pattern Python library, which supports Dutch, English, Spanish, German, French and Italian.

You also might find this website with inflection lists for German, English, Spanish, French, Italian, Portuguese and Russian useful.
3 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: emk and 2 guests