I've recently discovered GLM, Anki, HTLAL etc. started to try all methods to find out
what is the best one.
I came to an idea to create a word list based on text that I am going to read.
For example, I would like to read a book and have it in digital format. I don't want to be distracted during reading, lookup new words in dictionary, mark unknown words and read the text again. Therefore I would like to create a list of words in base form. Then divide all words into 2 groups: "known words", "unknown words". Then I will "goldlist" or "SRS" all unknown words for 1-2 months, and when ready-read the book.
For second book-it will be more easy, using dictionary from the first book I can filter out all processed words very easy.
For that reason I have to convert every single word into "Base form". I found out that Base Form is
called "Lemma" and the conversion process is called "Lemmatization."
The question is, what is the easiest way to do lemmatization. I was able to find it
for Romanic language(google:meaningcloud), but I will have to develop a lemmatization
tool using their APIs, that will cost time and efforts. And I am afraid that I try to reinvent the wheel.
My expectation from 'Lemmatizing tool' are:
Out of the sentence "I will be looking carefully after my children and take it over
from my wife that she takes advantage"
it creates a wordlist = {I, to look after, careful,child, to take over, wife} + 'to
take advantage' would be real advantage.
and in German "Ich hatte vor dir in Stille etwas zu sagen und mach' mein Türchen schnell
zu.
wordlist={ich, vorhaben, Stille, etwas, sagen, zumachen, Tür, schnell}
Question:
1.Are my expectation about Lemmatization tools and libraries too optimistic or there are tools that would have so much intelligence.
2. Am I trying to reinvent the wheel? Is there any other way to list all words?
Lemmatization tools for Goldlist
- Obogrew
- White Belt
- Posts: 34
- Joined: Fri Jan 29, 2016 4:00 am
- Location: Spain
- Languages: Russian(N), German(C2), English(C1), Hebrew(B1), Serbian(A2), French(A1), Turkish(A1), Spanish(B1)
- x 21
-
- White Belt
- Posts: 11
- Joined: Sun Jan 10, 2016 6:57 pm
- x 8
Re: Lemmatization tools for Goldlist
Back in the day I too was psyched about the idea of counting AND lemmatizing words in my corpora. In reality the existing programs are not able to spot either phrasal verbs (English) or separable verbs (German) and join them into one headword even if texts are POS-tagged. It seems easier to put people on Mars than to do just that.
1 x
- Bakunin
- Orange Belt
- Posts: 245
- Joined: Sun Jul 19, 2015 5:11 pm
- Location: Zürich
- Languages: German (N), English, Thai, Swiss-German (adv.), Khmer, Isaan (studying); dormant: French, Polish
- x 660
- Contact:
Re: Lemmatization tools for Goldlist
I can't comment on IE languages - and Hork might be right alluding to space travel - but for Turkish, which I see is one of your target languages, there's TRmorph, a relatively good morphological analyzer. While you're at it, check out TS Corpus as well.
I built my own corpus of curated texts for Thai as an intermediate student, and I'm in the process of doing the same for Khmer. Both languages are isolating which makes lemmatization trivial
I built my own corpus of curated texts for Thai as an intermediate student, and I'm in the process of doing the same for Khmer. Both languages are isolating which makes lemmatization trivial
0 x
-
- White Belt
- Posts: 11
- Joined: Sun Jan 10, 2016 6:57 pm
- x 8
Re: Lemmatization tools for Goldlist
I stand corrected. Das Institut für Deutsche Sprache http://www.ids-mannheim.de/kl/projekte/methoden/derewo.html seems to have managed to accurately count separable verbs in their 2009 lemma frequency list (but oddly not in their newer versions). For example, they listed "ankündigen" and "hinzufügen" as more frequent than "kündigen" and "fügen" which holds true for nonfiction (their main corpus source). So if they had simply lemmatized a word list these verbs would have been listed the other way around.
0 x
- Obogrew
- White Belt
- Posts: 34
- Joined: Fri Jan 29, 2016 4:00 am
- Location: Spain
- Languages: Russian(N), German(C2), English(C1), Hebrew(B1), Serbian(A2), French(A1), Turkish(A1), Spanish(B1)
- x 21
Re: Lemmatization tools for Goldlist
Thanks.
Looking through the IDS page I could not find any software. What I would need is the software or library to build а tool. Currently I am not focused on Turkish, but probably later I will use the link.
Any other ideas how to prepare for reading a book, in order not to stumble over unknown words? Or how to get a distilled list of unknown words I will encounter in the particular text? I am definitely not first and even not tenth who came across this idea.
Or nothing exists that is better than adapted text with a dictionary at the end?
Looking through the IDS page I could not find any software. What I would need is the software or library to build а tool. Currently I am not focused on Turkish, but probably later I will use the link.
Any other ideas how to prepare for reading a book, in order not to stumble over unknown words? Or how to get a distilled list of unknown words I will encounter in the particular text? I am definitely not first and even not tenth who came across this idea.
Or nothing exists that is better than adapted text with a dictionary at the end?
0 x
-
- Green Belt
- Posts: 402
- Joined: Sat Jul 18, 2015 6:21 pm
- Languages: German (N)
- x 801
Re: Lemmatization tools for Goldlist
Obogrew wrote:Or nothing exists that is better than adapted text with a dictionary at the end?
NLTK comes with a couple of lemmatizers and stemmers, but the results are pretty much useless for most languages other than English.
IMHO, you'll usually get much better results with the pattern Python library, which supports Dutch, English, Spanish, German, French and Italian.
You also might find this website with inflection lists for German, English, Spanish, French, Italian, Portuguese and Russian useful.
3 x
Return to “General Language Discussion”
Who is online
Users browsing this forum: emk and 2 guests