Word analysis - dialog vs. frequency list

General discussion about learning languages
maschingon
Yellow Belt
Posts: 64
Joined: Wed Aug 24, 2016 6:57 am
Location: Mérida, YUC, México
Languages: Spanish [C2], Portuguese [B2? C1 writing? no idea], Chinese [high intermediate level, no idea specifically]
Language Log: viewtopic.php?f=15&t=3906
x 33

Word analysis - dialog vs. frequency list

Postby maschingon » Mon Sep 12, 2016 6:05 pm

I apologize in advance if this counts as a double post - yesterday, for some reason I had the bright idea to post this exact question in the Spanish forum (in Spanish, obviously). I now realize that that was not the smartest move, as there was no reason to limit my audience to only those who can understand Spanish.

Anyways, for a while now I've been trying to figure out how to compare/contrast a dialog's content with a frequency list. In more figurative speech, I'm trying to put the 1st list of words (unique words contained in dialogue) on top of the frequency list, in order to come up with a third list that shows the holes in the dialog. Meaning, what is and what is not covered by the dialog.

I have 9 dialogs that correspond to the transcriptions o 9 episodes of a TV show (Bela a feia, the Brazilian version of Betty la fea / Ugly Betty). So I want to figure out 2 things: what the first episode alone covers, and what is covered in the first 9 episodes.

Also, I'll include the warning/apology I put on the Spanish post: I'm sorry if my username is offensive to anyone, I'll change it as soon as I figure out how --> "chingón" means something like "badass", so "el más chingón" would be a slightly vulgar way to say "the greatest". However, it is indeed an expletive, albeit a very popular one, so I apologize in advance if this offends anyone.

¡Gracias!
1 x
Michael King (seudónimos: Miguel Rey / Miguel Rei / 金一迈)

DangerDave2010
Orange Belt
Posts: 214
Joined: Sun Feb 14, 2016 5:10 am
Languages: gibberish (N)
x 291

Re: Word analysis - dialog vs. frequency list

Postby DangerDave2010 » Mon Sep 12, 2016 6:39 pm

I'd do in Python something like this:


dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))

I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.
2 x

yong321
Orange Belt
Posts: 121
Joined: Thu Feb 25, 2016 12:42 am
Location: Texas
Languages: English, Chinese. Spanish, French, Italian, German, reading comprehension only.
Language Log: http://yong321.freeshell.org/misc.html#lang
x 140
Contact:

Re: Word analysis - dialog vs. frequency list

Postby yong321 » Thu Sep 15, 2016 5:04 pm

I think you're saying, extract all unique words from the diaglog and find each of them in the frequency list, and see what words the dialog does NOT use. If you use a database (Access, MySQL, SQL Server, Oracle etc.), it's easy.

select word from frequency_table
minus
select word from diaglog_word_list;

Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.
2 x

maschingon
Yellow Belt
Posts: 64
Joined: Wed Aug 24, 2016 6:57 am
Location: Mérida, YUC, México
Languages: Spanish [C2], Portuguese [B2? C1 writing? no idea], Chinese [high intermediate level, no idea specifically]
Language Log: viewtopic.php?f=15&t=3906
x 33

Re: Word analysis - dialog vs. frequency list

Postby maschingon » Fri Sep 16, 2016 4:06 pm

DangerDave2010 wrote:dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))

I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.


yong321 wrote: I think you're saying, extract all unique words from the diaglog and find each of them in the frequency list, and see what words the dialog does NOT use. If you use a database (Access, MySQL, SQL Server, Oracle etc.), it's easy.

select word from frequency_table
minus
select word from diaglog_word_list;

Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.


To yong321 and DangerDave: First of all, sick names, both of you. Second of all, incredible advice. And yes, yong321, you are correct in my goal. I have no idea how to code, but I've been meaning to learn and this is a great chance to figure it out. THANK YOU both of you!

the system doesn't inform when a forum you created gets posted on, only when someone cites you. This should be changed, no?
1 x
Michael King (seudónimos: Miguel Rey / Miguel Rei / 金一迈)

User avatar
Serpent
Black Belt - 3rd Dan
Posts: 3657
Joined: Sat Jul 18, 2015 10:54 am
Location: Moskova
Languages: heritage
Russian (native); Belarusian, Polish

fluent or close: Finnish (certified C1), English; Portuguese, Spanish, German, Italian
learning: Croatian+, Ukrainian; Romanian, Galician; Danish, Swedish; Estonian
exploring: Latin, Karelian, Catalan, Dutch, Czech, Latvian
x 5181
Contact:

Re: Word analysis - dialog vs. frequency list

Postby Serpent » Fri Sep 16, 2016 4:20 pm

I think you'll get notifications if you click "follow" :)
1 x
LyricsTraining now has Finnish and Polish :)
Corrections welcome

User avatar
Adrianslont
Blue Belt
Posts: 827
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Learning Indonesian and French
x 1936

Re: Word analysis - dialog vs. frequency list

Postby Adrianslont » Fri Sep 16, 2016 11:28 pm

maschingon wrote:
DangerDave2010 wrote:dialogueText= u'Olá como vai\n[...]\nFIM'
myWordList = set(lemmatize(tokenize(dialogueText)))
myWordlist = sorted(myWordList, key = lambda word: wordFreq.get(word, 0), reverse = True)
myWordList = myWordList + sorted ([ a for a in wordFreq if a not in myWordList], key = lambda word: wordFreq[word], reverse = True)[:5000]
with open ('minhaListaDePalavras.txt', 'wb') as f:
f.write(u'\n'.join(myWordList).encode('utf8', 'ignore'))

I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.


yong321 wrote: I think you're saying, extract all unique words from the diaglog and find each of them in the frequency list, and see what words the dialog does NOT use. If you use a database (Access, MySQL, SQL Server, Oracle etc.), it's easy.

select word from frequency_table
minus
select word from diaglog_word_list;

Of course you need to load the frequency words into frequency_table and dialog words into diaglog_word_list first. Suppose the frequency list only contains lemmas, you need to lemmatize the words in the dialog. Some databases have built-in functionality for that. (In Oracle, lemma is called stem.) Depending on the size of the frequency list, the query shown above may give you a very long list of words, unless your dialog contains millions of words covering a wide range of topics.


To yong321 and DangerDave: First of all, sick names, both of you. Second of all, incredible advice. And yes, yong321, you are correct in my goal. I have no idea how to code, but I've been meaning to learn and this is a great chance to figure it out. THANK YOU both of you!

the system doesn't inform when a forum you created gets posted on, only when someone cites you. This should be changed, no?

Drop down menu next to the spanner/wrench at the bottom of page - choose "subscribe topic".
2 x

yong321
Orange Belt
Posts: 121
Joined: Thu Feb 25, 2016 12:42 am
Location: Texas
Languages: English, Chinese. Spanish, French, Italian, German, reading comprehension only.
Language Log: http://yong321.freeshell.org/misc.html#lang
x 140
Contact:

Re: Word analysis - dialog vs. frequency list

Postby yong321 » Sat Sep 17, 2016 2:16 am

maschingon:

Looks like I didn't misunderstand your intention. But I'm curious what your real goal is and how you can really achieve it. I'm interested in any type of work involving word frequency. I've done some work on French word frequency based on the data from lexique.org, and also created frequency lists of several languages using my own method.

> To yong321 and DangerDave: First of all, sick names, both of you.

Haha! Indeed I got a sick username. Almost 20 years ago I chose it as my yahoo.com username (yong123 was already taken by a Korean guy). This name contains my first name followed by sequential numbers. Both parts of it can be clearly said on the phone if a person calls and asks for my email. Back then, there was no instance messenger.

DangerDave is an even sicker name, don't you think?
0 x

maschingon
Yellow Belt
Posts: 64
Joined: Wed Aug 24, 2016 6:57 am
Location: Mérida, YUC, México
Languages: Spanish [C2], Portuguese [B2? C1 writing? no idea], Chinese [high intermediate level, no idea specifically]
Language Log: viewtopic.php?f=15&t=3906
x 33

Re: Word analysis - dialog vs. frequency list

Postby maschingon » Sat Sep 17, 2016 7:29 am

yong321 wrote:maschingon:

Looks like I didn't misunderstand your intention. But I'm curious what your real goal is and how you can really achieve it. I'm interested in any type of work involving word frequency. I've done some work on French word frequency based on the data from lexique.org, and also created frequency lists of several languages using my own method.


Interesting, I'll have to check lexique.org out tomorrow. My real goal is to figure out how effective a TV series could actually be if used as a language learning curriculum. I'd like to see what is covered in just the first episode, what's covered in the first 10 episodes, etc. I assume that a show with many many episodes will cover basically everything that you would ever need to know conversationally, and will miss out on lot of specific nouns, which I could make up for by going down the frequency range corresponding with the learner's current level. It'll be interesting to see what comes out of it.

I signed up for Udacity intro to programming course and did my first 4 hours of training today, I can already tell this is going to be perfect for me and my goals (I hope...). I've always been a natural at math so I'm hoping coding will come easy as well.

DangerDave is indeed a sick name, but you can't really compare DangerDave and Yong321. They taste completely different, it's like comparing oranges and apples.
1 x
Michael King (seudónimos: Miguel Rey / Miguel Rei / 金一迈)

Doitsujin
Green Belt
Posts: 404
Joined: Sat Jul 18, 2015 6:21 pm
Languages: German (N)
x 806

Re: Word analysis - dialog vs. frequency list

Postby Doitsujin » Sat Sep 17, 2016 9:53 am

DangerDave2010 wrote: I'd still need to write the tokenize() and the lemmatize() functions, as well as load the frequency dictionary.
AFAIK, NLTK and TextBlob already have ready-made functions for this. For some languages (es, de, fr, it, nl), however, you'll get better lemmatization results with the Pattern web mining library.

You also might find the dsl2mobi Github website helpul, which hosts English, French, Italian, German, Spanish, Portuguese, Polish and Russian inflection lists, which make reducing inflected forms to their canonical forms relatively easy. (The lists aren't perfect, though; some rarer forms are not listed.)
2 x

User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2132
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4869

Re: Word analysis - dialog vs. frequency list

Postby MorkTheFiddle » Wed Jul 12, 2017 6:44 pm

Bump. What ever happened to this word analysis experiment?
0 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson


Return to “General Language Discussion”

Who is online

Users browsing this forum: No registered users and 2 guests