I apologize in advance if this counts as a double post - yesterday, for some reason I had the bright idea to post this exact question in the Spanish forum (in Spanish, obviously). I now realize that that was not the smartest move, as there was no reason to limit my audience to only those who can understand Spanish.
Anyways, for a while now I've been trying to figure out how to compare/contrast a dialog's content with a frequency list. In more figurative speech, I'm trying to put the 1st list of words (unique words contained in dialogue) on top of the frequency list, in order to come up with a third list that shows the holes in the dialog. Meaning, what is and what is not covered by the dialog.
I have 9 dialogs that correspond to the transcriptions o 9 episodes of a TV show (Bela a feia, the Brazilian version of Betty la fea / Ugly Betty). So I want to figure out 2 things: what the first episode alone covers, and what is covered in the first 9 episodes.
Also, I'll include the warning/apology I put on the Spanish post: I'm sorry if my username is offensive to anyone, I'll change it as soon as I figure out how --> "chingón" means something like "badass", so "el más chingón" would be a slightly vulgar way to say "the greatest". However, it is indeed an expletive, albeit a very popular one, so I apologize in advance if this offends anyone.
¡Gracias!
Word analysis - dialog vs. frequency list
-
- Yellow Belt
- Posts: 64
- Joined: Wed Aug 24, 2016 6:57 am
- Location: Mérida, YUC, México
- Languages: Spanish [C2], Portuguese [B2? C1 writing? no idea], Chinese [high intermediate level, no idea specifically]
- Language Log: viewtopic.php?f=15&t=3906
- x 33
Word analysis - dialog vs. frequency list
1 x
Michael King (seudónimos: Miguel Rey / Miguel Rei / 金一迈)
-
- Orange Belt
- Posts: 214
- Joined: Sun Feb 14, 2016 5:10 am
- Languages: gibberish (N)
- x 291
Re: Word analysis - dialog vs. frequency list
I'd do in Python something like this:
dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))
I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.
dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))
I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.
2 x
-
- Orange Belt
- Posts: 121
- Joined: Thu Feb 25, 2016 12:42 am
- Location: Texas
- Languages: English, Chinese. Spanish, French, Italian, German, reading comprehension only.
- Language Log: http://yong321.freeshell.org/misc.html#lang
- x 140
- Contact:
Re: Word analysis - dialog vs. frequency list
I think you're saying, extract all unique words from the diaglog and find each of them in the frequency list, and see what words the dialog does NOT use. If you use a database (Access, MySQL, SQL Server, Oracle etc.), it's easy.
select word from frequency_table
minus
select word from diaglog_word_list;
Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.
select word from frequency_table
minus
select word from diaglog_word_list;
Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.
2 x
-
- Yellow Belt
- Posts: 64
- Joined: Wed Aug 24, 2016 6:57 am
- Location: Mérida, YUC, México
- Languages: Spanish [C2], Portuguese [B2? C1 writing? no idea], Chinese [high intermediate level, no idea specifically]
- Language Log: viewtopic.php?f=15&t=3906
- x 33
Re: Word analysis - dialog vs. frequency list
DangerDave2010 wrote:dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))
I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.
yong321 wrote: I think you're saying, extract all unique words from the diaglog and find each of them in the frequency list, and see what words the dialog does NOT use. If you use a database (Access, MySQL, SQL Server, Oracle etc.), it's easy.
select word from frequency_table
minus
select word from diaglog_word_list;
Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.
To yong321 and DangerDave: First of all, sick names, both of you. Second of all, incredible advice. And yes, yong321, you are correct in my goal. I have no idea how to code, but I've been meaning to learn and this is a great chance to figure it out. THANK YOU both of you!
the system doesn't inform when a forum you created gets posted on, only when someone cites you. This should be changed, no?
1 x
Michael King (seudónimos: Miguel Rey / Miguel Rei / 金一迈)
- Serpent
- Black Belt - 3rd Dan
- Posts: 3657
- Joined: Sat Jul 18, 2015 10:54 am
- Location: Moskova
- Languages: heritage
Russian (native); Belarusian, Polish
fluent or close: Finnish (certified C1), English; Portuguese, Spanish, German, Italian
learning: Croatian+, Ukrainian; Romanian, Galician; Danish, Swedish; Estonian
exploring: Latin, Karelian, Catalan, Dutch, Czech, Latvian - x 5181
- Contact:
Re: Word analysis - dialog vs. frequency list
I think you'll get notifications if you click "follow"
1 x
- Adrianslont
- Blue Belt
- Posts: 827
- Joined: Sun Aug 16, 2015 10:39 am
- Location: Australia
- Languages: English (N), Learning Indonesian and French
- x 1936
Re: Word analysis - dialog vs. frequency list
maschingon wrote:DangerDave2010 wrote:dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))
I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.yong321 wrote: I think you're saying, extract all unique words from the diaglog and find each of them in the frequency list, and see what words the dialog does NOT use. If you use a database (Access, MySQL, SQL Server, Oracle etc.), it's easy.
select word from frequency_table
minus
select word from diaglog_word_list;
Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.
To yong321 and DangerDave: First of all, sick names, both of you. Second of all, incredible advice. And yes, yong321, you are correct in my goal. I have no idea how to code, but I've been meaning to learn and this is a great chance to figure it out. THANK YOU both of you!
the system doesn't inform when a forum you created gets posted on, only when someone cites you. This should be changed, no?
Drop down menu next to the spanner/wrench at the bottom of page - choose "subscribe topic".
2 x
-
- Orange Belt
- Posts: 121
- Joined: Thu Feb 25, 2016 12:42 am
- Location: Texas
- Languages: English, Chinese. Spanish, French, Italian, German, reading comprehension only.
- Language Log: http://yong321.freeshell.org/misc.html#lang
- x 140
- Contact:
Re: Word analysis - dialog vs. frequency list
maschingon:
Looks like I didn't misunderstand your intention. But I'm curious what your real goal is and how you can really achieve it. I'm interested in any type of work involving word frequency. I've done some work on French word frequency based on the data from lexique.org, and also created frequency lists of several languages using my own method.
> To yong321 and DangerDave: First of all, sick names, both of you.
Haha! Indeed I got a sick username. Almost 20 years ago I chose it as my yahoo.com username (yong123 was already taken by a Korean guy). This name contains my first name followed by sequential numbers. Both parts of it can be clearly said on the phone if a person calls and asks for my email. Back then, there was no instance messenger.
DangerDave is an even sicker name, don't you think?
Looks like I didn't misunderstand your intention. But I'm curious what your real goal is and how you can really achieve it. I'm interested in any type of work involving word frequency. I've done some work on French word frequency based on the data from lexique.org, and also created frequency lists of several languages using my own method.
> To yong321 and DangerDave: First of all, sick names, both of you.
Haha! Indeed I got a sick username. Almost 20 years ago I chose it as my yahoo.com username (yong123 was already taken by a Korean guy). This name contains my first name followed by sequential numbers. Both parts of it can be clearly said on the phone if a person calls and asks for my email. Back then, there was no instance messenger.
DangerDave is an even sicker name, don't you think?
0 x
-
- Yellow Belt
- Posts: 64
- Joined: Wed Aug 24, 2016 6:57 am
- Location: Mérida, YUC, México
- Languages: Spanish [C2], Portuguese [B2? C1 writing? no idea], Chinese [high intermediate level, no idea specifically]
- Language Log: viewtopic.php?f=15&t=3906
- x 33
Re: Word analysis - dialog vs. frequency list
yong321 wrote:maschingon:
Looks like I didn't misunderstand your intention. But I'm curious what your real goal is and how you can really achieve it. I'm interested in any type of work involving word frequency. I've done some work on French word frequency based on the data from lexique.org, and also created frequency lists of several languages using my own method.
Interesting, I'll have to check lexique.org out tomorrow. My real goal is to figure out how effective a TV series could actually be if used as a language learning curriculum. I'd like to see what is covered in just the first episode, what's covered in the first 10 episodes, etc. I assume that a show with many many episodes will cover basically everything that you would ever need to know conversationally, and will miss out on lot of specific nouns, which I could make up for by going down the frequency range corresponding with the learner's current level. It'll be interesting to see what comes out of it.
I signed up for Udacity intro to programming course and did my first 4 hours of training today, I can already tell this is going to be perfect for me and my goals (I hope...). I've always been a natural at math so I'm hoping coding will come easy as well.
DangerDave is indeed a sick name, but you can't really compare DangerDave and Yong321. They taste completely different, it's like comparing oranges and apples.
1 x
Michael King (seudónimos: Miguel Rey / Miguel Rei / 金一迈)
-
- Green Belt
- Posts: 404
- Joined: Sat Jul 18, 2015 6:21 pm
- Languages: German (N)
- x 806
Re: Word analysis - dialog vs. frequency list
AFAIK, NLTK and TextBlob already have ready-made functions for this. For some languages (es, de, fr, it, nl), however, you'll get better lemmatization results with the Pattern web mining library.DangerDave2010 wrote: I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.
You also might find the dsl2mobi Github website helpul, which hosts English, French, Italian, German, Spanish, Portuguese, Polish and Russian inflection lists, which make reducing inflected forms to their canonical forms relatively easy. (The lists aren't perfect, though; some rarer forms are not listed.)
2 x
- MorkTheFiddle
- Black Belt - 2nd Dan
- Posts: 2132
- Joined: Sat Jul 18, 2015 8:59 pm
- Location: North Texas USA
- Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
- Language Log: https://forum.language-learners.org/vie ... 11#p133911
- x 4869
Re: Word analysis - dialog vs. frequency list
Bump. What ever happened to this word analysis experiment?
0 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson
Return to “General Language Discussion”
Who is online
Users browsing this forum: No registered users and 2 guests