The statistical distribution of language difficulty

General discussion about learning languages
s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2300

Re: The statistical distribution of language difficulty

Postby s_allard » Sat Jul 31, 2021 1:16 am

Le Baron wrote:
ryanheise wrote:And if you're interested in what the words were in Shrek that were beyond 98% comprehension, they were:

measuring, gent, homey, hideous, ballad, freshness, caterer, dignified, bachelorette, sparkling, unorthodox, isle, decorator, meteor, firewood, wed, ail, jackass, rescuer, redhead, valiant, reek, huff, magnetism, knights, sonnet, gingerbread, stench, enchantment, saucy, sharpest, leaver, decapitate, dazzling, thine, preposterous, shilling, pitchfork, minty, brimstone, raincoat, damsel, ta, beset, twinge, colada, rickety, veal, steed, stalwart, uninvited, pheromone, deride, cruelly, highness, pocus, asthmatic, chatterbox, hocus, yonder, rotisserie, rescuing, bonehead, parfait, camping, resettlement, slobber, Knights, tartare, eking, tush, compensating, gumdrop, hmph, dolt, backstreet, toadstool, slobbery, housefly, superfly, outdrew, tubbing

It's interesting that "measuring" appeared in that list, even though the words were lemmatised. It turns out that in this instance, measuring was used as a noun, so it was counted as a distinct word: "I'll let you do the measuring when you see him tomorrow."

I'm wondering how things like 'housefly' or 'raincoat' make it outside the top level of comprehension. They seem fairly self-evident as compounds of simple words!



A very important observation here that goes to the heart of the theory and method at hand. First of all, I have alluded in my previous post that there is a confusion in the use of the terms comprehension and word coverage. They are obviously not the same thing. All serious vocabulary studies use the term coverage when referring to presence of words in a text or medium. We then have to determine what percentage coverage is required for comprehension which is the subjective appreciation of the meaning or significance of the message. We typically hear of 98% word coverage for unassisted comprehension of printed or scripted materials. Things are obviously more complicated when it comes to spontaneous conversations or even movies and presumably podcasts where the nature of the voices and other elements must be taken into account.

As can be imagined, I get very irritated when I see loose talk of 90% or 80% or 50% comprehension, but I usually just let it slide. However this is particularly important for the issue raised here. How can some pretty simple words like freshness, raincoat or housefly be beyond 98% comprehension ?

The answer, simply put, is that it has nothing to do with comprehension. It’s all about word frequencies. Freshness, raincoat, housefly, just like caterer and firewood among many others didn’t make the cut for 98% word coverage.

In fact, once the assumption of 98% coverage is necessary of unassisted comprehension is accepted, comprehension no longer enters into the picture. This leads to some interesting observations. We are told that according to chart1 in this thread, Shrek rates a 13,999, meaning – if I understand the chart correctly – that it requires more vocabulary than approximately 36000 of the 40000 podcasts in the corpus here. As for difficulty, it’s the same thing.

I find it intriguing that a movie meant for children requires more vocabulary and is more difficult than the vast majority of 40000 podcasts aimed at adults. At least that’s my interpretation of the charts – I might be wrong.

The big question of course is what do children and even adults understand when watching this movie and how do they do so ? Obviously the movie is more than just a list of words and this is why I always wonder what is the significance of all these vocabulary studies for us language hobbyists. I can see the utility in formal language classes and learning materials design but anybody who has tried to learn a language from a word-frequency list knows that it doesn’t work very well.

The only conclusion I come to is that a learner of English would do very well to watch Shrek many times and study every word. It may be more useful than watching thousands of podcasts.
2 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Sat Jul 31, 2021 3:19 am

Le Baron wrote:
ryanheise wrote:And if you're interested in what the words were in Shrek that were beyond 98% comprehension, they were:

measuring, gent, homey, hideous, ballad, freshness, caterer, dignified, bachelorette, sparkling, unorthodox, isle, decorator, meteor, firewood, wed, ail, jackass, rescuer, redhead, valiant, reek, huff, magnetism, knights, sonnet, gingerbread, stench, enchantment, saucy, sharpest, leaver, decapitate, dazzling, thine, preposterous, shilling, pitchfork, minty, brimstone, raincoat, damsel, ta, beset, twinge, colada, rickety, veal, steed, stalwart, uninvited, pheromone, deride, cruelly, highness, pocus, asthmatic, chatterbox, hocus, yonder, rotisserie, rescuing, bonehead, parfait, camping, resettlement, slobber, Knights, tartare, eking, tush, compensating, gumdrop, hmph, dolt, backstreet, toadstool, slobbery, housefly, superfly, outdrew, tubbing

It's interesting that "measuring" appeared in that list, even though the words were lemmatised. It turns out that in this instance, measuring was used as a noun, so it was counted as a distinct word: "I'll let you do the measuring when you see him tomorrow."

I'm wondering how things like 'housefly' or 'raincoat' make it outside the top level of comprehension. They seem fairly self-evident as compounds of simple words!


True, compound words are interesting because you can use your knowledge of the component words to guess the meaning of the compound word. The comprehension value of that could be some fractional amount rather than just 1 or 0. E.g. If I know "house" and "fly", then maybe I have 0.5 comprehension of "housefly" but I'm not 100% sure of it's meaning. I don't know whether it's any fly that comes into your house, or whether it's actually a specific species of fly. I can imagine a child asking "Mum, what is a backstreet?" And maybe after hearing the explanation in terms of its component words, it will be clear, while until then the child will only be able to guess to some fractional degree and have partial comprehension of that word. Then if the learner knows the definition of the actual compound word itself, then we can say its a full 1 for comprehension.

It is something I could potentially look at in the future. Although it's still an open question how much of a significant effect this would have. My difficulty formula is not as binary as Paul Nation's in the first place, so if compound words make up a small proportion of total words, as they do here (note that the vast majority of words in that unknown 2% list are non-compound words that are genuinely rarer. e.g. "beset" truly is less likely to be known than something slightly higher up, like "soulful"), THEN the formula treats it accordingly and will only affect the score in a small way. But thanks for bring this up, as there are some ideas I could try out in the future.
3 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Sat Jul 31, 2021 6:09 am

s_allard wrote:First of all, I have alluded in my previous post that there is a confusion in the use of the terms comprehension and word coverage. They are obviously not the same thing.


I agree! But yes, you will need to take this terminology a bit loosely, since my goal is to actually diverge from the graph 1 methodology and develop something more useful in graph 2. The idea of the Paul Nation calculation (the one employed in graph 1) is to arrive at comprehension from 98% coverage. And the way that he does it is rather binary. On one side of that sharp 98% boundary you comprehend the text, while on the other side of the boundary you don't comprehend it. I do not think this model of comprehension is quite right. In reality, comprehension falls on a scale: adults will comprehend a greater percentage of Shrek than a young child. Paul Nation's methodology, however, is just not ideal for capturing that.

That is one motivation for instead trying to formulate the difficulty score in the second graph. The difficulty score does not use a sharp knife to cut comprehension into binary outcomes, but rather uses a gradual scale. Going a single word under 98% coverage will not have a complete 180 affect on the result, it will only have a very gradual effect on the result.

But terminology aside, let's focus on what the results mean:

We are told that according to chart1 in this thread, Shrek rates a 13,999, meaning – if I understand the chart correctly – that it requires more vocabulary than approximately 36000 of the 40000 podcasts in the corpus here. As for difficulty, it’s the same thing.


For difficulty, it's not quite the same thing, it's actually ranked a bit easier on that scale, although yes, there still is a lot of vocabulary in Shrek that is not among the most common words and therefore less likely for a learner to know.

I find it intriguing that a movie meant for children requires more vocabulary and is more difficult than the vast majority of 40000 podcasts aimed at adults. At least that’s my interpretation of the charts – I might be wrong.

The big question of course is what do children and even adults understand when watching this movie and how do they do so ? Obviously the movie is more than just a list of words ...


I agree that the visual element adds more comprehensibility to the movie. If we look at comprehension as a percentage, then I expect young children will have a lower comprehension rate, but will still main interest because they are following (and are entertained by) the visuals.

... and this is why I always wonder what is the significance of all these vocabulary studies for us language hobbyists. I can see the utility in formal language classes and learning materials design but anybody who has tried to learn a language from a word-frequency list knows that it doesn’t work very well.


I wouldn't say the message from this is to learn words from a frequency list. Instead, I see frequency data as allowing us to estimate the probability that a word is known, and from that, estimate the difficulty of a given document. At least what I am trying to do is simply rank all of the content by difficulty to make it easier for a learner to find something that's at their level. It doesn't really need to be perfect, either. Supposing you had a list of content sorted by "approximate" difficulty, you could just move the scroll bar around, click on a random item to listen to it, and judge for yourself whether you think it's at your level. If it's too hard, you could move the slider one way and try again, or if it's too easy for you, you could move the slider in the other direction and try again.
3 x

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23122
Contact:

Re: The statistical distribution of language difficulty

Postby rdearman » Sat Jul 31, 2021 9:27 am

About Shrek. There is a lot of adult humour in that movie which goes right over children's head's. Recently I read a French book and didn't understand one section because I am not French and didn't live there in the 80s. So even though me and the children understand more than 98% we didn't comprehend.

So getting hung up on word a b or c isn't all that useful. Taking a general study such as this graph and trying to shoehorn particular cases into it will not work. Every learner is different. You cannot say that just because more than 98% of humans have legs all 98% of them can walk.
4 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Sat Jul 31, 2021 9:46 am

I agree, we should not try to read too much into this. Of course it is necessary to get hung up on word a, b, c for the purpose of improving the algorithm because ultimately it is based on the words, but the goal is to approximately and automatically rank the difficulty of content to make it easier to find something that is at your level. Ultimately it is up to the learner, though, to take that list and make their own pick based on which content feels good for them.

I have some updated numbers on the Shrek transcript. I noticed that my preprocessing was off because I was pulling the script from a different source than I normally do and it required some different processing from podcasts.

Known words%: required vocabulary

* 98%: 11,064
* 95%: 5,306
* 90%: 2,042
* 85%: 823
* 80%: 357

And for the difficulty score, the new number is 28,707.

So slightly easier, but still on the difficult end (and looking at the vocabulary in Shrek compared to a typical conversational podcast, I am not too surprised by that.)
2 x

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2300

Re: The statistical distribution of language difficulty

Postby s_allard » Sat Jul 31, 2021 12:07 pm

ryanheise wrote:I agree, we should not try to read too much into this. Of course it is necessary to get hung up on word a, b, c for the purpose of improving the algorithm because ultimately it is based on the words, but the goal is to approximately and automatically rank the difficulty of content to make it easier to find something that is at your level. Ultimately it is up to the learner, though, to take that list and make their own pick based on which content feels good for them.

I have some updated numbers on the Shrek transcript. I noticed that my preprocessing was off because I was pulling the script from a different source than I normally do and it required some different processing from podcasts.

Known words%: required vocabulary

* 98%: 11,064
* 95%: 5,306
* 90%: 2,042
* 85%: 823
* 80%: 357

And for the difficulty score, the new number is 28,707.

So slightly easier, but still on the difficult end (and looking at the vocabulary in Shrek compared to a typical conversational podcast, I am not too surprised by that.)


I’m really intrigued by the fact that the children’s movie Shrek requires more vocabulary than around 35000 adult podcasts most of which probably last 20 – 30 minutes each whereas Shrek lasts only 95 minutes. Sure, podcasts can be very conversational but isn’t a movie all conversational ?

Paul Nation who studied the same movie starts by stating :

The popular children’s movie Shrek was chosen for analysis. The script,
excluding stage directions, is almost 10,000 tokens long, and uses a total
of almost 1,100 word-families.


https://www.lextutor.ca/cover/papers/nation_2006.pdf

So, in reality, Shrek does not contain that many different words. But how rare are those words ? Later on, the researchers conclude :

Let us now return to the question of how big a vocabulary you need in order to be familiar with most words in Shrek. Table 12 gives cumulative percentage coverage figures for the tokens in Shrek. Proper nouns account for 1.47% of the running words in Shrek. With a vocabulary of 4,000 word-families, and assuming that proper nouns are easily understood, 96.70% of the tokens would be familiar to children watching the movie. This means that there would be 1 unknown word in about every 30 running words. With a vocabulary of 7,000 words plus proper nouns, 98.08% of the tokens would be familiar to children watching the movie. This means there would be 1 unknown word in about every 50 running words.

It should be pointed out here that although Shrek only contains around 1100 word-families, for reasons of methodology explained in the article, a viewer would need a vocabulary of the 7000 most common words in English in order to be familiar with all the words in the movie.

The figure of 11094 words of required vocabulary reported here can be explained by differences of methodology that I will leave to the author to explain. But the question remains : what words of Shrek are so different from those of 35000 podcasts ?

The interesting tidbit that strikes me is that with 357 words one gets 80% coverage in Shrek. So few words for a lot of coverage and probably 0 comprehension. And to go from 95% to 93% coverage you have to more than double the number of words. Fascinating indeed.
1 x

User avatar
Le Baron
Black Belt - 3rd Dan
Posts: 3505
Joined: Mon Jan 18, 2021 5:14 pm
Location: Koude kikkerland
Languages: English (N), fr, nl, de, eo, Sranantongo,
Maintaining: es, swahili.
Language Log: https://forum.language-learners.org/vie ... 15&t=18796
x 9384

Re: The statistical distribution of language difficulty

Postby Le Baron » Sat Jul 31, 2021 12:41 pm

ryanheise wrote:True, compound words are interesting because you can use your knowledge of the component words to guess the meaning of the compound word. The comprehension value of that could be some fractional amount rather than just 1 or 0. E.g. If I know "house" and "fly", then maybe I have 0.5 comprehension of "housefly" but I'm not 100% sure of it's meaning. I don't know whether it's any fly that comes into your house, or whether it's actually a specific species of fly. I can imagine a child asking "Mum, what is a backstreet?" And maybe after hearing the explanation in terms of its component words, it will be clear...

Yes, I suppose I'm not allowing for vagueness in meaning and approaching it from my own comprehension of English. I can see how someone might falter over the compound, a 'housefly' could be anything up to and including a case of a flying house! :D
2 x

User avatar
luke
Brown Belt
Posts: 1243
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 3631

Re: The statistical distribution of language difficulty

Postby luke » Sat Jul 31, 2021 1:30 pm

Le Baron wrote:
ryanheise wrote:True, compound words are interesting because you can use your knowledge of the component words to guess the meaning of the compound word.

Yes, I suppose I'm not allowing for vagueness in meaning and approaching it from my own comprehension of English. I can see how someone might falter over the compound, a 'housefly' could be anything up to and including a case of a flying house! :D

Yes, and same with 'horsefly'. A flying horse? We've all seen pictures. Even if we know it's not a flying horse or a flying house (Wizard of Oz movie), it can still distract our attention or pull us into imagination.

There's also the 'interest' factor. One of the cool things about looking at a lot of data like ryanheise is doing is that one can use 'difficulty' as a factor in their "should I spend time on this" equation.

We can gauge difficulty ourselves, but sometimes science or maybe even data science can help.

To my head, concrete, steel, ice, are all hard. But if one wants to drill into one of them, a different tool is helpful for each.

A hammer is not going to build a house, and a nail-gun is not going to build a house, but a good tool improves efficiency. If you need to move in, efficiency counts.
Last edited by luke on Sat Jul 31, 2021 1:47 pm, edited 2 times in total.
4 x

User avatar
Le Baron
Black Belt - 3rd Dan
Posts: 3505
Joined: Mon Jan 18, 2021 5:14 pm
Location: Koude kikkerland
Languages: English (N), fr, nl, de, eo, Sranantongo,
Maintaining: es, swahili.
Language Log: https://forum.language-learners.org/vie ... 15&t=18796
x 9384

Re: The statistical distribution of language difficulty

Postby Le Baron » Sat Jul 31, 2021 1:43 pm

I've seen a peanut stand, heard a rubber band and a needle that winked its eye.
1 x

Dragon27
Blue Belt
Posts: 616
Joined: Tue Aug 25, 2015 6:40 am
Languages: Russian (N)
English - best foreign language
Polish, Spanish - passive advanced
Tatar, German, French, Greek - studying
x 1375

Re: The statistical distribution of language difficulty

Postby Dragon27 » Sat Jul 31, 2021 2:10 pm

Le Baron wrote:I can see how someone might falter over the compound, a 'housefly' could be anything up to and including a case of a flying house! :D

luke wrote:Yes, and same with 'horsefly'. A flying horse?

I don't know. From the way these compounds are formed I don't see how 'horsefly' can mean 'a flying horse' (at best, it could mean "the flight of a horse"). Besides, it certainly would be obvious that it shouldn't mean that from the context.
1 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: tastyonions and 2 guests