The statistical distribution of language difficulty

General discussion about learning languages
User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Thu Jul 29, 2021 3:42 am

s_allard wrote:
ryanheise wrote:

From an analysis of 40,000 episodes of English speaking podcasts, the following chart plots how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment:

This is a rough measurement helped by some simplifying assumptions:
...
1. The meaning of comprehension is baked in at 98% minimum known words.
2. Words are learnt in the order of frequent to infrequent.
3. I count each distinct word form, not lemma.

...


Thanks for the interesting answers to my comments. I don’t agree with all the answers but they are food for thought. Here I want to revisit the statement that I disagree with fundamentally : … « how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment: »


You seem to have misunderstood.

Image

"Known words" (x-axis) is your vocabulary size, while the y-axis is how many podcast episodes exist where 98% of the tokens in it are within your vocabulary.

So given the simplifying assumption that the learner is learning words off my set vocabulary list (which in my case is a word frequency list generated by the entire corpus, but in Paul Nation's case is a set vocabulary list for a pre-defined grade for his graded readers), what the graph means is that:

* Knowing the top 1000 most frequent words in the corpus, you will know 98% of the tokens in 15 documents.
* Knowing the top 2000 most frequent words, you will know 98% of the tokens in an additional 105 documents.
* Knowing the top 3000 most frequent words, you will have 98% coverage of an additional 2653 documents.

So on the histogram, the first 3 buckets contain 15, 105 and 2653. That is all.

Just to put it another way, the x-axis describes your vocabulary size (labelled "known words"), while the y-axis describes how many documents you would be able to comprehend with that vocabulary size, using a Paul Nation-style method of estimating comprehensibility.

Let’s take that magic number of 5000 words or more specifically 5000 unlemmatized tokens. We are not talking about unique words or word families. We’re looking at the number of units that Microsoft Word will display at the bottom of the screen. So the more you write or talk, the larger your vocabulary word token count, not necessarily the number of different words you know.


The size of the document is irrelevant in Paul Nation's method, which I am using here. You don't simply count how many words are in the document, you start with the learner's set of known words (where we classify the learner by these artificial buckets), and THEN you look at every word in a given document and see whether it is one of the words that this artificial learner knows. Once you get to the end of the document, you look at what "percentage" of the words in the document were already known by the learner. If that percentage is above a certain comprehensibility threshold, in this case 98%, then we say that document is comprehensible given that vocabulary. That is what the first graph shows, a basic Paul Nation-style calculation, but just using different buckets. The same style of analysis is applicable for different word counting methods (e.g. word families, lemmas, etc.). In that case, you normalise each word before entering it into the bucket's vocabulary list, and you also normalise the word before you look that word up in the vocabulary list.

How many unique words are there in that 5000 token count ? To simplify things, I’ll use a typical figure of 40% Type to Token Ratio. This gives us a vocabulary of 2000 word families or unique words.

We are told that with this vocabulary size and according to the first chart a speaker could theoretically understand around 9000 podcasts of 5000 word tokens each. In fact we are explicitly told that at this vocabulary sweet spot, the speaker could understand the majority of all the podcast with 98 % word coverage.


Based on the above comments, I will assume this point is cleared up.

There is something wrong here. I won’t rehash the work of Paul Nation whose computational linguistics credentials may be a bit long in the tooth but there is a basic methodological issue here that Nation thoroughly explored. To state that with 5000 tokens you can understand all 9000 podcasts of 5000 tokens is to assume that all these podcasts contain the identical tokens.


It may be easier if you assume that I am familiar with Paul Nation's work ;-)

But in any case, hopefully the above answers help to clarify the methodology and what the graph depicts. "Known words" describes the learner's vocabulary size, divided into arbitrary buckets according to assumptions.
4 x

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: The statistical distribution of language difficulty

Postby s_allard » Thu Jul 29, 2021 5:49 am

ryanheise wrote:...
You seem to have misunderstood.

Image

"Known words" (x-axis) is your vocabulary size, while the y-axis is how many podcast episodes exist where 98% of the tokens in it are within your vocabulary.

So given the simplifying assumption that the learner is learning words off my set vocabulary list (which in my case is a word frequency list generated by the entire corpus, but in Paul Nation's case is a set vocabulary list for a pre-defined grade for his graded readers), what the graph means is that:

* Knowing the top 1000 most frequent words in the corpus, you will know 98% of the tokens in 15 documents.
* Knowing the top 2000 most frequent words, you will know 98% of the tokens in an additional 105 documents.
* Knowing the top 3000 most frequent words, you will have 98% coverage of an additional 2653 documents.

So on the histogram, the first 3 buckets contain 15, 105 and 2653. That is all.

Just to put it another way, the x-axis describes your vocabulary size (labelled "known words"), while the y-axis describes how many documents you would be able to comprehend with that vocabulary size, using a Paul Nation-style method of estimating comprehensibility.
....

I’m not trying to be obdurate but let me see if I get this right. Looking at the x-axis of the chart I see the number of my known words, i.e. my vocabulary size measured in unlemmatized tokens taken from the list of the most frequent words of the corpus. And then looking up I see the number of documents where 98% of the word tokens are in my vocabulary. I can therefore understand all those documents. Thus with 5000 word tokens or around 2000 word families, 98% of the words of the majority of the podcasts are in my vocabulary. What have I misunderstood?

The problem I have is how to reconcile these figures with commonly seen figures in vocabulary research such as that of Paul Nation among others. For example, in a well-known paper How Large a Vocabulary Is Needed For Reading and Listening? the very first sentence of the conclusion is:

If we take 98% as the ideal coverage, a 8,000–9,000 word-family vocabulary is needed for dealing with written text, and 6,000–7,000 families for dealing with spoken text.

Using similar terminology, I’m trying to determine what word-family vocabulary size is needed for dealing with podcasts.

I’m not sure that I follow all the methodological subtleties but I find it hard to explain the huge difference between the two sets of figures. In his study, Paul Nation looked at only two movies, Shrek and Toy Story, whereas here we are looking at 40000 podcasts. Why do podcasts have such a tiny vocabulary?
1 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Thu Jul 29, 2021 7:26 am

s_allard wrote:I’m not trying to be obdurate but let me see if I get this right. Looking at the x-axis of the chart I see the number of my known words, i.e. my vocabulary size measured in unlemmatized tokens taken from the list of the most frequent words of the corpus. And then looking up I see the number of documents where 98% of the word tokens are in my vocabulary. I can therefore understand all those documents. Thus with 5000 word tokens or around 2000 word families, 98% of the words of the majority of the podcasts are in my vocabulary. What have I misunderstood?


Your understanding as expressed above is correct. The misunderstanding earlier was when you suspected that the length of the document had any significance to the results.

The problem I have is how to reconcile these figures with commonly seen figures in vocabulary research such as that of Paul Nation among others. For example, in a well-known paper How Large a Vocabulary Is Needed For Reading and Listening? the very first sentence of the conclusion is:

If we take 98% as the ideal coverage, a 8,000–9,000 word-family vocabulary is needed for dealing with written text, and 6,000–7,000 families for dealing with spoken text.

Using similar terminology, I’m trying to determine what word-family vocabulary size is needed for dealing with podcasts.


I wouldn't attempt to draw comparisons between these two results, though. To do such a comparison, you'd have to use exactly the same method in both cases. Paul Nation's "word families" are not going to be exactly the same as lemmas produced by an automated lemmatiser. The handling of proper nouns would have to also be done in exactly the same way, etc. From memory, I think that they actually identified the word families by hand, and that is probably why they had to restrict their analysis to a limited set of documents. Let's just say it is not my intention to reproduce exactly their research methods for podcasts, I'm far more interested in developing the measure of difficulty, and that necessarily factors in things that Paul Nation's approach does not factor in.

That said, I do not mind re-doing my analysis from graph 1 using lemmas if you feel that would help you make the comparison you wish to make.

However, based on the above numbers, I have a sneaking suspicion that my data set used for graph 1 may have already been lemmatised (it was created a fair while ago, so there's a chance I did that and forgot.)

And since the difficulty score data is the latest and uses probably a newer lemmatiser, and I still have the intermediate output of the lemmatiser, I can actually extract the lemmas out of that and recreate graph 1 based on this updated data. I expect this data should also be more precise. Again, I wouldn't expect a perfect comparison, but let's see. I'll share once it's done.

In his study, Paul Nation looked at only two movies, Shrek and Toy Story, whereas here we are looking at 40000 podcasts. Why do podcasts have such a tiny vocabulary?


If anyone can find the movie scripts for Shrek and Toy Story, it may be better for me to simply run those two movies through my own analysis and then we would have a more valid comparison to know whether podcasts do actually have a smaller vocabulary than movies. It may be true, but I wouldn't be sure of that until running a comparable analysis. If it is true, we can then start to speculate on why that may be.

(The proposed analysis wouldn't necessarily be perfect either, since my corpus would contain 40,000 podcasts and 2 movies, which is a bit skewed.)
5 x

User avatar
luke
Brown Belt
Posts: 1243
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 3631

Re: The statistical distribution of language difficulty

Postby luke » Thu Jul 29, 2021 10:39 am

ryanheise wrote:my corpus [...] 40,000 podcasts

This just came together in my head. Your corpus IS the 40,000 podcasts and the frequency list is directly from THOSE podcasts.

So, the word "podcast" is perhaps frequent in your corpus. (from intros, out-tros, etc).

If someone wanted to listen to English podcasts AND they had your frequency list and the difficulty ranking for those podcasts, then they could more easily select "comprehensible input" based on your work.

Well done.

You said all that in your post, but I got lost after seeing "1000.00" in the graphs and the recovery has been slow. :lol:

Most of my muddled thinking was around "where did the frequency list come from"? That originated from my experience with a Spanish frequency dictionary where the word "French" was word 646. I get it, french fries, french kiss, french poodle, etc. But still was surprising. Likely "french" has a different ranking in your 40,000 podcast corpus.

And of course all you guys brag about the size of your corpus. ;)

The weakness of frequency dictionaries, as I hear senior forum members say, is the corpus itself.

I still think they're helpful.

So, now a tangent. Have the device, kindle, whatever, generate the frequency list for whatever corpus the user specifies. This may be a relatively easy feature if the default "corpus" is always "this book" and 500 or so "connector words" that are in every book are filtered out of the result set by default. Thus, for language learners, a "create a frequency list and translation for the top N words in this book" feature.

Fondly remembering my first semester Physics professor explaining "significant digits" and that he would mark wrong any answer that had more significant digits than was reasonable based on the input. That's the most memorable thing I got out of that class, which I enjoyed very much. It has been more impactful than knowing F=MA or Ohm's law, but I'm not a physicist. :)
2 x

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: The statistical distribution of language difficulty

Postby s_allard » Thu Jul 29, 2021 12:43 pm

ryanheise wrote:
s_allard wrote:I’m not trying to be obdurate but let me see if I get this right. Looking at the x-axis of the chart I see the number of my known words, i.e. my vocabulary size measured in unlemmatized tokens taken from the list of the most frequent words of the corpus. And then looking up I see the number of documents where 98% of the word tokens are in my vocabulary. I can therefore understand all those documents. Thus with 5000 word tokens or around 2000 word families, 98% of the words of the majority of the podcasts are in my vocabulary. What have I misunderstood?


Your understanding as expressed above is correct. The misunderstanding earlier was when you suspected that the length of the document had any significance to the results.

...


Thanks for taking the time to answer my persistent questions and for the clarifications. I’ll put aside further comparisons with Nation’s work because, as was rightly pointed out, there are some methodological issues that only muddy the waters. But I do want to come back to what I see is a major flaw in the analysis at hand.

You have confirmed that my reading of the chart that with a vocabulary of just 5000 unlemmatized words I will have 98% coverage and hence full comprehension of over 20000 podcasts. Doesn’t this mean that all these podcasts must have the same vocabulary, i.e. identical to my 5000 known words ?

But aren’t those 20000 podcasts individually unique and with differences in content that are reflected in the vocabulary ? I recognize that since the words are not lemmatized a large number of the most frequent words will be identical but surely not 98% of the respective words of over 20000 podcasts.

What I do believe we are seeing here in the chart is that many podcasts have the same number of unlemmatized words, probably because they are of the same length, especially 30 minutes. The chart may be correct but to conclude that podcasts of similar word count have identical words is fundamentally incorrect.

For some classes of mine, I had transcribed three children’s stories in French from youtube videos, all narrated by the same voice : Ali Baba et les 40 voleurs (Ali Baba and the 40 Thieves), Le lièvre et la tortue (The Hare and the Tortoise) and Le petit chaperon rouge (Little Red Riding Hood). I used the first 8 minutes of the recordings so the word counts are roughly the same.

I won’t bother doing any analysis other than to say that while the word count of unlemmatized words is nearly identical – let’s say X - each story obviously has some unique vocabulary. To understand all three stories requires more than just X words. In fact the more stories I want to read the more vocabulary I need. Which is why reading a wide range of works is so important.

The point of all this is that 5000 unlemmatized words are certainly not enough to give 98% coverage of over 20000 podcasts.
1 x

User avatar
luke
Brown Belt
Posts: 1243
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 3631

Re: The statistical distribution of language difficulty

Postby luke » Thu Jul 29, 2021 1:51 pm

s_allard wrote:You have confirmed that my reading of the chart that with a vocabulary of just 5000 unlemmatized words I will have 98% coverage

If that is indeed what was said, I understand your skepticism.
s_allard wrote: and hence full comprehension of over 20000 podcasts.

Don't know that he went to your "hence" conclusion though. If "full comprehension of over 20000 podcasts" in a 40,000 podcast corpus indeed was implied, I'd think it would be a data quality issue or a prematurity in the analysis.

But I definitely applaud the original poster. Putting together this sort of analysis isn't simple or easy. I'd be bothered if I created and shared it and people criticized "my newborn baby". Pardon me for being one of the critics.
1 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Thu Jul 29, 2021 2:23 pm

luke wrote:
ryanheise wrote:my corpus [...] 40,000 podcasts

This just came together in my head. Your corpus IS the 40,000 podcasts and the frequency list is directly from THOSE podcasts.

So, the word "podcast" is perhaps frequent in your corpus. (from intros, out-tros, etc).


That's correct :) It's ranked the 140th most frequent. Here are the top 150 words (this many can at least fit in this comment without taking up too much space):

be, and, you, the, I, to, that, a, of, it, we, have, do, in, not, they, so, like, know, for, this, go, on, but, what, just, with, get, there, he, think, about, or, can, my, if, all, as, people, really, thing, at, say, because, yeah, out, want, right, when, how, will, then, would, time, up, from, um, make, she, some, see, more, work, now, kind, come, well, lot, who, way, talk, where, look, take, very, need, mean, other, those, into, something, no, start, year, these, uh, good, back, could, by, which, actually, here, feel, also, little, even, day, much, use, give, try, life, first, let, love, through, different, help, okay, find, tell, oh, great, over, one, put, call, happen, why, business, always, show, maybe, down, bit, any, than, many, every, sure, still, around, guy, most, part, again, off, before, podcast, should, point, question, only, new, change, today, same, stuff, learn

If someone wanted to listen to English podcasts AND they had your frequency list and the difficulty ranking for those podcasts, then they could more easily select "comprehensible input" based on your work.

Well done.


Thanks! Yes, that's exactly what I'm hoping to achieve.

So, now a tangent. Have the device, kindle, whatever, generate the frequency list for whatever corpus the user specifies. This may be a relatively easy feature if the default "corpus" is always "this book" and 500 or so "connector words" that are in every book are filtered out of the result set by default. Thus, for language learners, a "create a frequency list and translation for the top N words in this book" feature.


I actually generate a mini corpus for each podcast episode before combining them to form the global corpus, so I do have that data. But it may or may not be practical to build this as some sort of app. At the moment, it's all just files on my hard drive. We're talking about a huge amount of data, and also a huge amount of processing. That's not something that can fit on a kindle (or whatever), so it would need to be hosted, and the hosting costs would be thousands of dollars per month at a minimum. Instead, what I have in mind to start with is to just upload the results of all my analysis, which is the sorted list of podcasts. But I'll investigate the hosting costs and see what can be done.

s_allard wrote:But I do want to come back to what I see is a major flaw in the analysis at hand.

You have confirmed that my reading of the chart that with a vocabulary of just 5000 unlemmatized words I will have 98% coverage and hence full comprehension of over 20000 podcasts.


I've checked and it is actually lemmatised.

Doesn’t this mean that all these podcasts must have the same vocabulary, i.e. identical to my 5000 known words ?


No, because each histogram bucket is actually a span, so the first bucket is if you have a vocabulary size between 0-1000. The second bucket is if you have a vocabulary size between 1001 to 2000, and so on. If you make the buckets smaller, you will get a smaller number of matching podcasts in each bucket.

What I do believe we are seeing here in the chart is that many podcasts have the same number of unlemmatized words, probably because they are of the same length, especially 30 minutes.


Aside from very short episodes where there will be some noise in the data, there will be no significant difference in the results based on episode length. This is because we're dealing with ratios. At a comprehension rate of 98%, we tolerate 2 words in every 100 that are beyond our vocabulary. No matter how long the podcast episode is, as long as this rate or ratio is within your limits, you will find it equally tolerable. That is, a 1000 word podcast with 20 unknown words will be just as comprehensible to you as a 2000 word podcast with 40 unknown words. The length does not affect the result, your comprehension rate is still 98% in both cases.

If you're more of a visual person, then consider the following pattern of easy "o" and difficult "x" words:

1. o o o o o o o o x o
2. o o o o o x o o o o o o o x o o o o o o o o o o o o o o x o o x o o o o o o o o

At a target comprehension rate of 90%, both will be equally comprehensible to you because for each x you hear, there will be 9 o's that you hear, and that means you will understand 90% of the words and be happy.
3 x

Beli Tsar
Green Belt
Posts: 384
Joined: Mon Oct 22, 2018 3:59 pm
Languages: English (N), Ancient Greek (intermediate reading), Latin (Beginner) Farsi (Beginner), Biblical Hebrew (Beginner)
Language Log: https://forum.language-learners.org/vie ... =15&t=9548
x 1294

Re: The statistical distribution of language difficulty

Postby Beli Tsar » Thu Jul 29, 2021 3:21 pm

The major scepticism you are receiving seems to centre around the apparently small vocabulary your analysis say is required. Obviously, you've dealt with the major portion of that methodologically, and especially by pointing out that it is lemmatized. Does the frequency list below demonstrate the other reason?
ryanheise wrote:be, and, you, the, I, to, that, a, of, it, we, have, do, in, not, they, so, like, know, for, this, go, on, but, what, just, with, get, there, he, think, about, or, can, my, if, all, as, people, really, thing, at, say, because, yeah, out, want, right, when, how, will, then, would, time, up, from, um, make, she, some, see, more, work, now, kind, come, well, lot, who, way, talk, where, look, take, very, need, mean, other, those, into, something, no, start, year, these, uh, good, back, could, by, which, actually, here, feel, also, little, even, day, much, use, give, try, life, first, let, love, through, different, help, okay, find, tell, oh, great, over, one, put, call, happen, why, business, always, show, maybe, down, bit, any, than, many, every, sure, still, around, guy, most, part, again, off, before, podcast, should, point, question, only, new, change, today, same, stuff, learn

This is - very obviously - a frequency list that matches daily conversation: 'yeah... um... uh... good... actually... oh... great' - these are filler words typical of conversation. Half the attraction of podcasts is that they are very down-to-earth and conversational in tone, even the ones that aren't actually conversations, and there are plenty of those. They are much less scripted than most recorded dialogue.

And we already know that normal daily conversation uses a smaller range of vocabulary than books or films.

So is there any surprise that they require less vocabulary than the carefully crafted scripts of Toy Story and Shrek? Sure, those are kids films, but as has often been noted on the forum, kid's content isn't necessarily simpler in the ways we think. And those two films both contain a range of specialist vocabulary, at least as much as a podcast might - whether it is words for toys or medieval fantasy.

If your analysis is right, have you already demonstrated one valuable thing - that podcasts, because they are conversational, are actually a great place for language learners to start listening, much better than films? Perhaps we knew that already, and of course we'd need to run the numbers on an equivalent bucket of films, to be certain, but might that be a plausible interpretation at least?
8 x
: 0 / 50 1/2 Super Challenge - Latin Reading
: 0 / 50 1/2 Super Challenge - Latin 'Films'

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: The statistical distribution of language difficulty

Postby s_allard » Thu Jul 29, 2021 4:05 pm

ryanheise wrote:...

s_allard wrote:But I do want to come back to what I see is a major flaw in the analysis at hand.

You have confirmed that my reading of the chart that with a vocabulary of just 5000 unlemmatized words I will have 98% coverage and hence full comprehension of over 20000 podcasts.


I've checked and it is actually lemmatised.

Doesn’t this mean that all these podcasts must have the same vocabulary, i.e. identical to my 5000 known words ?


No, because each histogram bucket is actually a span, so the first bucket is if you have a vocabulary size between 0-1000. The second bucket is if you have a vocabulary size between 1001 to 2000, and so on. If you make the buckets smaller, you will get a smaller number of matching podcasts in each bucket.
....


I hate to be seen as nagging here but I really have a problem understanding just exactly how many documents I can understand with a vocabulary size of 5000 word types or families.

By the way, the fact that the lists are lemmatized changes much of the discussion.

When I look at the chart it’s difficult to see the exact number of podcasts that correspond to a 5000-word vocabulary size. It would seem to be about 8000 podcasts. As someone else has mentioned, I the language learner could choose from these 8000 documents and be sure that they are of my level, i.e. my vocabulary size will give 98% coverage of any randomly chosen document from the corresponding histogram.

Now that we are using lemmatized lists, I find these values more realistic but the fundamental problem remains. Don’t the 8000 documents each have to contain nearly identical words for my individual list to give me 98% coverage in all 8000 documents? How could it be otherwise ?

If the 8000 podcasts have identical vocabulary, then this would a great learning tool. I could learn all the words of just one podcast and then I’m sure to then know 98% of the words of all the remaining 7999 podcasts – in addition to all the other podcasts of smaller size.

This would be the very opposite of the more traditional approach where we build vocabulary by reading or listening widely until we achieve that 98% threshold where we can understand a new document easily.

But if those podcasts have individual vocabulary differences - as I suspect - despite having identical vocabulary sizes, then it's a whole different matter. Then we are looking at the aggregate vocabulary required to achieve 98% coverage of all 8000 podcasts.

Finally, I do want to emphasize that my persistent questioning does not in any way take away from my admiration for the excellent work that has gone into all this. My concern is really in the concrete applications and the pedagogical implications.
0 x

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23127
Contact:

Re: The statistical distribution of language difficulty

Postby rdearman » Thu Jul 29, 2021 4:26 pm

s_allard wrote:I hate to be seen as nagging here but I really have a problem understanding just exactly how many documents I can understand with a vocabulary size of 5000 word types or families.

I think you are confused..
s_allard wrote:Don’t the 8000 documents each have to contain nearly identical words for my individual list to give me 98% coverage in all 8000 documents?

Let's take some fictional examples.

Podcast A uses only 50% of your known vocabulary, you know lots of vocabulary, but this podcast only used 50%. However, there was a percentage you didn't know say... 2% some words like: Analogue, Assistive technology., Attachment., Back-end., Backward compatible and Bandwidth which you didn't know.

Podcast B is much harder and uses every word in your 98% coverage, and the 2% you didn't know were things like. Domesticate, Ecology, Ecosystem, Environment, Enzymes, and Proteins. E.g words you didn't know, but not the same words you didn't know in Podcast A.

Podcast C is for children and only used 10% of your known vocabulary, and there weren't any words you didn't know.

So 98% covered Podcasts A, B and C. Some use more of your known words, some use less. SO you have 8000 podcasts, which your block of vocabulary will cover 98% of the words, but 2% where it will not cover them. Your block of vocabulary didn't change, but the amount required from that block for each podcast is different.
6 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.


Return to “General Language Discussion”

Who is online

Users browsing this forum: iguanamon, tastyonions and 2 guests