The statistical distribution of language difficulty

Beli Tsar · Postby **Beli Tsar** » Fri Jul 30, 2021 11:26 am

s_allard wrote:
Beli Tsar wrote:
s_allard wrote:I have asked a number of times : Don’t the podcasts within the same histogram or bucket have to have nearly identical vocabularies ? The answer is no as long as 98% of their vocabulary comes from the 5000-word list. But that said, I tend to think that podcasts of similar vocabulary breadth, i.e. similar unique word sizes would tend to have very similar sets of word tokens. For example, two podcasts with each the same 5000 unique words from my vocabulary set would have nearly identical token vocabularies with small differences in the two sets of 100 words that are not in my vocabulary.

This is simply not true, and the examples the OP and rdearman have given do make this clear. There is no requirement for the podcasts to cover all the words on the list, so they can differ both in the words that are in your vocabulary and in those that are not.

It's just the same as the way graded readers can be about radically different subjects while still sticking to a very narrow set of words - often much narrower than 5000!

I think I said quite clearly in my own answer that podcasts in the same histogram generally do not need to have identical vocabularies. But what happens when multiple podcasts contain a very large proportion of the listener's vocabulary? So a number of podcasts each contain 5000 unique words and I can fully comprehend each podcast with 98% word coverage. Doesn't that mean that each podcast has the same common 5000 words from my list plus the little two percent of individual variation? What other words can there be?

This is perhaps hypothetically true, but is not a realistic, real-life occurrence. Does any podcast - with 5000 unique words - exist?

Unlike Ryanheise, I don't have a podcast corpus to test this on. But to do a quick and dirty (Bayesian-style?) confirmation I used what I had. I'm a regular public speaker/have a youtube channel etc, not so different from podcasting, so I analysed a few scripts for unique words. There were between 700 and 760 for talks that were roughly around the 20-28 minute mark, and remarkable consistency within that range, so that 760 seems to be a high outlier. These are non-technical talks for normal people, but still, they aren't simplified for language learners, or anything like that. This seems to confirm the idea that we don't use that many unique words in this kind of communication - something that fits with your own philosophy of language learning and vocabulary acquisition, as I understand it from your older posts?

Any podcast with 5000 unique words would have to be either many, many hours long or near-incomprehensible jibberish.

s_allard · Postby **s_allard** » Fri Jul 30, 2021 12:22 pm

ryanheise wrote:...

But despite these documents both going into the same bucket, that does not imply ANYTHING about the breadth of vocabulary actually used in each document, how many unique words it has, how many time a word may be repeated in the document, or anything like that. As you can plainly see, these two documents are completely different in those respects. I have said this before, and other forum members have also pointed it out, but since you keep talking about these things such as document vocabulary size, unique tokens, etc., I just want to say it again, these things are not factors in the Paul Nation calculation as used in graph 1, and it gets us nowhere to keep talking about them if you truly want to understand what is depicted in graph 1.

Paul Nation's calculation is very basic (which is why I criticise it). It only looks at what is the most advanced word you need to know in each document. In document 1, that is "hat", and in document 2, that is "cat". Once the calculation finds that word, it pretty much ignores every other word in the document and says "hat". Right, that's bucket 7 for you. "cat" - right, that's bucket 7 for you, too.

First of all, thanks for all that work. It really wasn't necessary.

OK I may not be the sharpest knife in the drawer but I can read (and write a few published academic papers). Here I read in the first post « the following chart plots how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment ». Sounds good. I’m intrigued.

The chart looks impressive indeed but I notice something a bit unusual. Let’s say I have a vocabulary of 10000 known words. (I should point out that we find out ex post-facto and after some prodding that these are lemmas although it is plainly stated « 3. I count each distinct word form, not lemma. » But I’ll let that pass.)

I’m not sure what the others see but I see that with my 10000-word vocabulary I understand much fewer episodes than someone with a 4000 word vocabulary. With 25000 lemmas, I would hardly understand any episodes.

I seem to be the only person who finds this misleading and totally counterintuitive. Try as I may to follow the many examples subsequently provided and despite learning how to put words into various buckets I still don’t know how many episodes I would theoretically be able to understand with a vocabulary size of 10000.

I hate dragging the work of Paul Nation into all this because it risks confusing the issues but I can at least understand the following table taken from this publication :
https://www.lextutor.ca/tests/nation_beglar_size_2007.pdf

(Sorry, I haven't figured out yet how to format tables here)
Table 1. Vocabulary sizes needed to get 98% coverage (including proper nouns) of various kinds of texts
Texts 98% Coverage Proper Nouns

Novels	9,000 word families	1-2%
Newspapers	8,000 word families	5-6%
Children's Movies	6,000 word families	1.5%
Spoken English	7,000 word families	1.3%

It seems pretty clear to me. Obviously there is no mention of podcasts but I assume that with my 10000 word families, I’ll be OK.
Edit 1: trying to get the table to format properly

Beli Tsar · Postby **Beli Tsar** » Fri Jul 30, 2021 12:51 pm

s_allard wrote:I’m not sure what the others see but I see that with my 10000-word vocabulary I understand much fewer episodes than someone with a 4000 word vocabulary. With 25000 lemmas, I would hardly understand any episodes.

Just think of each bar as how much more you can understand than the person at the previous level, not as how many podcasts you can understand.

ryanheise · Postby **ryanheise** » Fri Jul 30, 2021 1:17 pm

I’m not sure what the others see but I see that with my 10000-word vocabulary I understand much fewer episodes than someone with a 4000 word vocabulary. With 25000 lemmas, I would hardly understand any episodes.

It sounds like you want a cumulative histogram instead of an ordinary histogram:

That is fine, and I have always encouraged people to post their own visualisations below, but the fact that there are different ways of visualising the same data does not change the fact that the data is the same. I have made the data available to you from the beginning so why not create the particular visualisation you seem to prefer and share it? Personally, I prefer the ordinary histogram because it makes the distribution of items between buckets clearer.

I would also remind you that you already asked this same question before and I already answered it:

ryanheise wrote:
s_allard wrote:Finally, it seems to me that vocabulary size has a cumulative effect which is not reflected in this first chart,

Nor the second. But that is completely intentional.

This is fairly inconsequential, though. All of the raw data is there in the spreadsheet, and it is entirely possible to create different visualisations of the same data, e.g. as a cumulative graph if that is what you would find interesting. However, I set out specifically with a goal of placing each document into a bucket, not placing the same document into multiple buckets (i.e. I do not want to place the beginner podcasts into the intermediate bucket just because an intermediate learner could understand them.) because I was more interested in the distribution of content among these buckets. I encourage you to create further visualisations and share them below.

Onward:

I seem to be the only person who finds this misleading and totally counterintuitive. Try as I may to follow the many examples subsequently provided and despite learning how to put words into various buckets I still don’t know how many episodes I would theoretically be able to understand with a vocabulary size of 10000.

I guess it could be counterintuitive if you are new to histograms. I'm sorry for that, I assumed it was clear, but if you're not clear, hopefully the above picture helps explain that an ordinary histogram shows the distribution between buckets, while a cumulative histogram shows the cumulative values of all buckets up to a specific bucket.

I hate dragging the work of Paul Nation into all this because it risks confusing the issues but I can at least understand the following table taken from this publication :
https://www.lextutor.ca/tests/nation_beglar_size_2007.pdf

(Sorry, I haven't figured out yet how to format tables here)

Table 1. Vocabulary sizes needed to get 98% coverage (including proper nouns) of various kinds of texts
Texts 98% Coverage Proper Nouns
Novels 9,000 word families 1-2%
Newspapers 8,000 word families 5-6%
Children’s Movies 6,000 word families 1.5%
Spoken English 7,000 word families 1.3%

That is not the type of analysis I am doing here. Graph 1 is closest to his work on sorting graded readers. But in any case, if your only issue now is that you prefer a different way of charting or visualising the same data, I have always said, here is all of the data, you are welcome to create and share your own visualisations of it below.

ryanheise · Postby **ryanheise** » Fri Jul 30, 2021 1:54 pm

Beli Tsar wrote:
s_allard wrote:I think I said quite clearly in my own answer that podcasts in the same histogram generally do not need to have identical vocabularies. But what happens when multiple podcasts contain a very large proportion of the listener's vocabulary? So a number of podcasts each contain 5000 unique words and I can fully comprehend each podcast with 98% word coverage. Doesn't that mean that each podcast has the same common 5000 words from my list plus the little two percent of individual variation? What other words can there be?

This is perhaps hypothetically true, but is not a realistic, real-life occurrence. Does any podcast - with 5000 unique words - exist?

Unlike Ryanheise, I don't have a podcast corpus to test this on. But to do a quick and dirty (Bayesian-style?) confirmation I used what I had. I'm a regular public speaker/have a youtube channel etc, not so different from podcasting, so I analysed a few scripts for unique words. There were between 700 and 760 for talks that were roughly around the 20-28 minute mark, and remarkable consistency within that range, so that 760 seems to be a high outlier. These are non-technical talks for normal people, but still, they aren't simplified for language learners, or anything like that. This seems to confirm the idea that we don't use that many unique words in this kind of communication - something that fits with your own philosophy of language learning and vocabulary acquisition, as I understand it from your older posts?

Any podcast with 5000 unique words would have to be either many, many hours long or near-incomprehensible jibberish.

That is an interesting point. I decided to make a graph showing how many unique words each podcast has:

upload image image

The greatest number of unique words for a podcast episode was 2,905 unique words, and it took an episode 3.5 hours in length to reach such a high count.

s_allard · Postby **s_allard** » Fri Jul 30, 2021 2:28 pm

Beli Tsar wrote:
s_allard wrote:I’m not sure what the others see but I see that with my 10000-word vocabulary I understand much fewer episodes than someone with a 4000 word vocabulary. With 25000 lemmas, I would hardly understand any episodes.

Just think of each bar as how much more you can understand than the person at the previous level, not as how many podcasts you can understand.

Thank you. I knew that is was as simple as that. But the real issue here is shoddy labelling not the underlying science which is a separate question. I would have written something like « the following chart plots the differential number of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment ».

Then change the label on the y-axis of course. In the label and throughout the post I would make clear that we are talking about 98% word coverage not 98% comprehension. This is very important. I would also have used the terms lemmas, lexemes or word types to avoid any confusion over meaning of words.

Then the chart would make a lot of sense and be useful.

What I take away is that there is a sweet spot of around 5000 – 6000 word families for optimum vocabulary size that will cover the great majority of podcasts. That sounds fine with me and, interestingly, aligns pretty much with Nation’s figures and taking into account differences in methodology.

I do also appreciate the fact that one can look at the same data and put the podcasts into buckets according to the size of vocabulary necessary for 98% coverage.

ryanheise · Postby **ryanheise** » Fri Jul 30, 2021 2:37 pm

I found the movie script for Shrek and ran my analysis on it using the Podcast corpus with the following results:

* It requires a vocabulary of 13,293 for 98% comprehension, with reference to graph 1.
* It has a difficulty score of 31,910 with reference to graph 2.

This comes with the caveat mentioned earlier (we are measuring a movie using a podcast corpus). Still, it is all spoken language, so it is still an interesting result.

I think that when a kid watches Shrek, they do not really aim for 98% comprehension, though. A lot of the dialogue will go over their heads, but that's fine because they can still enjoy what's happening visually.

So let's look at the required vocabulary size again for different comprehension rates:

* 98%: 13,293
* 95%: 5,741
* 90%: 2,187
* 85%: 890
* 80%: 359

And if you're interested in what the words were in Shrek that were beyond 98% comprehension, they were:

measuring, gent, homey, hideous, ballad, freshness, caterer, dignified, bachelorette, sparkling, unorthodox, isle, decorator, meteor, firewood, wed, ail, jackass, rescuer, redhead, valiant, reek, huff, magnetism, knights, sonnet, gingerbread, stench, enchantment, saucy, sharpest, leaver, decapitate, dazzling, thine, preposterous, shilling, pitchfork, minty, brimstone, raincoat, damsel, ta, beset, twinge, colada, rickety, veal, steed, stalwart, uninvited, pheromone, deride, cruelly, highness, pocus, asthmatic, chatterbox, hocus, yonder, rotisserie, rescuing, bonehead, parfait, camping, resettlement, slobber, Knights, tartare, eking, tush, compensating, gumdrop, hmph, dolt, backstreet, toadstool, slobbery, housefly, superfly, outdrew, tubbing

It's interesting that "measuring" appeared in that list, even though the words were lemmatised. It turns out that in this instance, measuring was used as a noun, so it was counted as a distinct word: "I'll let you do the measuring when you see him tomorrow."

s_allard · Postby **s_allard** » Fri Jul 30, 2021 2:51 pm

Beli Tsar wrote:...

Unlike Ryanheise, I don't have a podcast corpus to test this on. But to do a quick and dirty (Bayesian-style?) confirmation I used what I had. I'm a regular public speaker/have a youtube channel etc, not so different from podcasting, so I analysed a few scripts for unique words. There were between 700 and 760 for talks that were roughly around the 20-28 minute mark, and remarkable consistency within that range, so that 760 seems to be a high outlier. These are non-technical talks for normal people, but still, they aren't simplified for language learners, or anything like that. This seems to confirm the idea that we don't use that many unique words in this kind of communication - something that fits with your own philosophy of language learning and vocabulary acquisition, as I understand it from your older posts?

Any podcast with 5000 unique words would have to be either many, many hours long or near-incomprehensible jibberish.

Again thanks for the comment. Considering the various clarifications in the course of the thread I don't think this idea is worth pursuing. That said, I noticed of course the small number of unique words in your youtube vids. The numbers align nearly exactly with the figures of Ryanheise in a subsequent post.

This is not surprising to me and really shouldn’t be to anyone. You will only need around 700 words to understand a given podcast or youtube vid but the more podcasts you listen to the more different words you will hear and that’s why we need 5000 words to get good coverage for the majority of podcasts.

As for how many words you need to speak, don’t get me started.

Le Baron · Postby **Le Baron** » Fri Jul 30, 2021 3:02 pm

ryanheise wrote:And if you're interested in what the words were in Shrek that were beyond 98% comprehension, they were:

measuring, gent, homey, hideous, ballad, freshness, caterer, dignified, bachelorette, sparkling, unorthodox, isle, decorator, meteor, firewood, wed, ail, jackass, rescuer, redhead, valiant, reek, huff, magnetism, knights, sonnet, gingerbread, stench, enchantment, saucy, sharpest, leaver, decapitate, dazzling, thine, preposterous, shilling, pitchfork, minty, brimstone, raincoat, damsel, ta, beset, twinge, colada, rickety, veal, steed, stalwart, uninvited, pheromone, deride, cruelly, highness, pocus, asthmatic, chatterbox, hocus, yonder, rotisserie, rescuing, bonehead, parfait, camping, resettlement, slobber, Knights, tartare, eking, tush, compensating, gumdrop, hmph, dolt, backstreet, toadstool, slobbery, housefly, superfly, outdrew, tubbing

It's interesting that "measuring" appeared in that list, even though the words were lemmatised. It turns out that in this instance, measuring was used as a noun, so it was counted as a distinct word: "I'll let you do the measuring when you see him tomorrow."

I'm wondering how things like 'housefly' or 'raincoat' make it outside the top level of comprehension. They seem fairly self-evident as compounds of simple words!

luke · Postby **luke** » Fri Jul 30, 2021 4:25 pm

ryanheise wrote:That is an interesting point. I decided to make a graph showing how many unique words each podcast has:

upload image image

The greatest number of unique words for a podcast episode was 2,905 unique words, and it took an episode 3.5 hours in length to reach such a high count.

That is a fascinating result!

If all of our data is good, then podcasts - at least those in the ryanheise sample set - may be much easier as a source of comprehensible input.

With the caveat that books have visual text and movies have images, both of which can increase comprehension.

I seem to recall from a Professor Arguelles talk that 5000 words was generally sufficient for day-to-day conversation.

Perhaps the thing with podcasts, compared with "general conversation", is that the topics are often more tightly defined and the overall time is less (even at 3.5 hours).

Just to make that clear: General Conversation: You will talk to a lot of different people. Each idiolect will be part of your input. You have less control over what they may say. And finally, the 5000 words is not for a short period of time, but regular day-after-day-after-day exposure. Plus, in a conversation, you can ask for clarification or the speaker might see you're not understanding and adjust their speech accordingly.

Bel Tsar wrote:But to do a quick and dirty (Bayesian-style?) confirmation I used what I had. I'm a regular public speaker/have a youtube channel etc, not so different from podcasting, so I analysed a few scripts for unique words. There were between 700 and 760 for talks that were roughly around the 20-28 minute mark, and remarkable consistency within that range, so that 760 seems to be a high outlier. These are non-technical talks for normal people, but still, they aren't simplified for language learners, or anything like that. This seems to confirm the idea that we don't use that many unique words in this kind of communication - something that fits with your own philosophy of language learning and vocabulary acquisition, as I understand it from your older posts?

That is so fascinating.

I'm curious about several things: How many talks in a "few"?

What's your channel? Were the words are lemmatized to some degree? How many total words end up in a 20-28 minute talk? Is it easy to combine all the talks and have a conclusion like, "My 30 youtubes, which is about 15 hours of content, has 1300 unique words"?

Is our tentative hypothesis that YouTube word frequency is similar to a podcast (based on y'all's results)? That leads to the speculation that perhaps YouTube and podcasts are excellent for narrowing the word frequency compass.

A language learners’ forum

The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Who is online