Word analysis - dialog vs. frequency list

maschingon · Postby **maschingon** » Mon Sep 12, 2016 6:05 pm

I apologize in advance if this counts as a double post - yesterday, for some reason I had the bright idea to post this exact question in the Spanish forum (in Spanish, obviously). I now realize that that was not the smartest move, as there was no reason to limit my audience to only those who can understand Spanish.

Anyways, for a while now I've been trying to figure out how to compare/contrast a dialog's content with a frequency list. In more figurative speech, I'm trying to put the 1st list of words (unique words contained in dialogue) on top of the frequency list, in order to come up with a third list that shows the holes in the dialog. Meaning, what is and what is not covered by the dialog.

I have 9 dialogs that correspond to the transcriptions o 9 episodes of a TV show (Bela a feia, the Brazilian version of Betty la fea / Ugly Betty). So I want to figure out 2 things: what the first episode alone covers, and what is covered in the first 9 episodes.

Also, I'll include the warning/apology I put on the Spanish post: I'm sorry if my username is offensive to anyone, I'll change it as soon as I figure out how --> "chingón" means something like "badass", so "el más chingón" would be a slightly vulgar way to say "the greatest". However, it is indeed an expletive, albeit a very popular one, so I apologize in advance if this offends anyone.

¡Gracias!

DangerDave2010 · Postby **DangerDave2010** » Mon Sep 12, 2016 6:39 pm

I'd do in Python something like this:

dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))

I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.

yong321 · Postby **yong321** » Thu Sep 15, 2016 5:04 pm

I think you're saying, extract all unique words from the diaglog and find each of them in the frequency list, and see what words the dialog does NOT use. If you use a database (Access, MySQL, SQL Server, Oracle etc.), it's easy.

select word from frequency_table
minus
select word from diaglog_word_list;

Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.

maschingon · Postby **maschingon** » Fri Sep 16, 2016 4:06 pm

DangerDave2010 wrote:dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))

I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.

yong321 wrote: I think you're saying, extract all unique words from the diaglog and find each of them in the frequency list, and see what words the dialog does NOT use. If you use a database (Access, MySQL, SQL Server, Oracle etc.), it's easy.

select word from frequency_table
minus
select word from diaglog_word_list;

Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.

To yong321 and DangerDave: First of all, sick names, both of you. Second of all, incredible advice. And yes, yong321, you are correct in my goal. I have no idea how to code, but I've been meaning to learn and this is a great chance to figure it out. THANK YOU both of you!

the system doesn't inform when a forum you created gets posted on, only when someone cites you. This should be changed, no?

Website · Postby **Serpent** » Fri Sep 16, 2016 4:20 pm

I think you'll get notifications if you click "follow"

Adrianslont · Postby **Adrianslont** » Fri Sep 16, 2016 11:28 pm

maschingon wrote:
DangerDave2010 wrote:dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))

I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.

yong321 wrote: I think you're saying, extract all unique words from the diaglog and find each of them in the frequency list, and see what words the dialog does NOT use. If you use a database (Access, MySQL, SQL Server, Oracle etc.), it's easy.

select word from frequency_table
minus
select word from diaglog_word_list;

Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.

To yong321 and DangerDave: First of all, sick names, both of you. Second of all, incredible advice. And yes, yong321, you are correct in my goal. I have no idea how to code, but I've been meaning to learn and this is a great chance to figure it out. THANK YOU both of you!

the system doesn't inform when a forum you created gets posted on, only when someone cites you. This should be changed, no?

Drop down menu next to the spanner/wrench at the bottom of page - choose "subscribe topic".

yong321 · Postby **yong321** » Sat Sep 17, 2016 2:16 am

maschingon:

Looks like I didn't misunderstand your intention. But I'm curious what your real goal is and how you can really achieve it. I'm interested in any type of work involving word frequency. I've done some work on French word frequency based on the data from lexique.org, and also created frequency lists of several languages using my own method.

> To yong321 and DangerDave: First of all, sick names, both of you.

Haha! Indeed I got a sick username. Almost 20 years ago I chose it as my yahoo.com username (yong123 was already taken by a Korean guy). This name contains my first name followed by sequential numbers. Both parts of it can be clearly said on the phone if a person calls and asks for my email. Back then, there was no instance messenger.

DangerDave is an even sicker name, don't you think?

maschingon · Postby **maschingon** » Sat Sep 17, 2016 7:29 am

yong321 wrote:maschingon:

Looks like I didn't misunderstand your intention. But I'm curious what your real goal is and how you can really achieve it. I'm interested in any type of work involving word frequency. I've done some work on French word frequency based on the data from lexique.org, and also created frequency lists of several languages using my own method.

Interesting, I'll have to check lexique.org out tomorrow. My real goal is to figure out how effective a TV series could actually be if used as a language learning curriculum. I'd like to see what is covered in just the first episode, what's covered in the first 10 episodes, etc. I assume that a show with many many episodes will cover basically everything that you would ever need to know conversationally, and will miss out on lot of specific nouns, which I could make up for by going down the frequency range corresponding with the learner's current level. It'll be interesting to see what comes out of it.

I signed up for Udacity intro to programming course and did my first 4 hours of training today, I can already tell this is going to be perfect for me and my goals (I hope...). I've always been a natural at math so I'm hoping coding will come easy as well.

DangerDave is indeed a sick name, but you can't really compare DangerDave and Yong321. They taste completely different, it's like comparing oranges and apples.

Doitsujin · Postby **Doitsujin** » Sat Sep 17, 2016 9:53 am

DangerDave2010 wrote: I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.

AFAIK, NLTK and TextBlob already have ready-made functions for this. For some languages (es, de, fr, it, nl), however, you'll get better lemmatization results with the Pattern web mining library.

You also might find the dsl2mobi Github website helpul, which hosts English, French, Italian, German, Spanish, Portuguese, Polish and Russian inflection lists, which make reducing inflected forms to their canonical forms relatively easy. (The lists aren't perfect, though; some rarer forms are not listed.)

MorkTheFiddle · Postby **MorkTheFiddle** » Wed Jul 12, 2017 6:44 pm

Bump. What ever happened to this word analysis experiment?

A language learners’ forum

Word analysis - dialog vs. frequency list

Word analysis - dialog vs. frequency list

Re: Word analysis - dialog vs. frequency list

Re: Word analysis - dialog vs. frequency list

Re: Word analysis - dialog vs. frequency list

Re: Word analysis - dialog vs. frequency list

Re: Word analysis - dialog vs. frequency list

Re: Word analysis - dialog vs. frequency list

Re: Word analysis - dialog vs. frequency list

Re: Word analysis - dialog vs. frequency list

Re: Word analysis - dialog vs. frequency list

Who is online