A language learners’ forum

Posted: **Sun Jul 25, 2021 2:28 pm**

If we examine every book, every movie, or every utterance in some given language, and divide all of that language up into buckets of different difficulty levels, how would the language be distributed across those difficulty buckets? Is most language intermediate? Is most language advanced? How much of naturally occurring language is beginner level?

From an analysis of 40,000 episodes of English speaking podcasts, the following chart plots how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment:

This is a rough measurement helped by some simplifying assumptions:

1. The meaning of comprehension is baked in at 98% minimum known words.
2. Words are learnt in the order of frequent to infrequent.
3. I count each distinct word form, not lemma.

The full spreadsheet is here if anyone would like to play with the data or create different visualisations of it. Or if you'd like me to rerun the analysis with different assumptions, let me know below.

Some takeaways:

1. There is a relative shortage of natural content that is simple enough for beginners to understand.
2. Learning vocabulary is more rewarding for the first 5,000 words, after which a natural plateau is reached.
3. The long tail after the 5,000 word mark constitutes the majority of content.

The second chart below shows a similar result, but using a computed notion of language difficulty (which includes vocabulary size as just one factor among others):

There is little difference to notice visually between the two graphs, except that if you were to look at the list of actual podcast episodes in sorted order, the two lists would be in different orders, because the episode that requires the smallest vocabulary is not necessarily going to be the easiest podcast (e.g. if it has difficult grammar). So if we want to know which podcast would be most suitable for me to listen to next at my current level, that is where the difficulty scores in the second graph might be more interesting.

This analysis follows on a previous post SRS vs natural repetition which was based on the same data set. It's been a while since I've had a look at this data set (more due to health issues rather than laziness) but I suddenly felt like resurrecting it and ultimately getting it to a point where I can repeat the analysis for other languages, and publish the lists so that anyone can use it to find suitable podcasts to listen to.

Posted: **Mon Jul 26, 2021 2:52 am**

Just want to say I just love all your experiments and analyses!

They're always so interesting and lead to many interesting discussions!

Posted: **Mon Jul 26, 2021 10:58 am**

ryanheise wrote:The second chart below shows a similar result, but using a computed notion of language difficulty (which includes vocabulary size as just one factor among others):

Did you use Flesch-Kincaid, Dale-Chall or something more recent?

Posted: **Mon Jul 26, 2021 1:47 pm**

Cainntear wrote:Did you use Flesch-Kincaid, Dale-Chall or something more recent?

I'm using my own formula which I may have described elsewhere on this forum, but as I'm using a combination of standard techniques that all computational linguists would know, I doubt it's anything radically novel.

Basically, I assume that a word is easier if it occurs more frequently in my corpus. I also consider how often a word occurs within a document, and how many other documents that word occurs in. I then assign a difficulty score to each word in the document, and then combine all of that into a single score. There are various approaches you could take: arithmetic mean, geometric mean, etc. But I look at the distribution of difficult words throughout the document. If they are more evenly distributed, that makes the overall document more difficult because the difficult words have infected more sentences. The assumption being that the overall comprehensibility of a document is not correlated to how many words you understand, but how many sentences you understand.

To do this, I built a corpus that provides the relevant frequency data for each word after it has been lemmatised.

On the grammar side, I transform the document into grammatical symbols and repeat the same sort of analysis as above.

I do not currently factor in sentence length, word length, or anything like that. At least in my experience as an adult language learner, I find I am able to cope with long sentences in a foreign language, and it is primarily my familiarity with the words and the grammar in the sentence that dictate whether I will comprehend it. I can understand that children would struggle with sentence length, particularly if as a result of recursive grammatical structures, and I think that is a matter of brain development. No doubt sentence length is a factor, but just a less significant factor for me. I guess my formula was tailored to myself. (Perhaps I could do something to measure the grammatical recursion rather than sentence length, though.)

There are various choices and assumptions I have made in the above algorithm, assumptions that would eventually need to be tested. When it gets to that point, I may use machine learning techniques to solve for the optimal set of assumptions.

Finally, I am using some techniques that I borrowed from the way search engines like Google work to score and rank content, and there is a happy overlap between ranking for relevancy/importance and ranking for difficulty that I hope to exploit in making this database searchable. So for example, it would be useful to be able to find content that is both comprehensible (not difficult) and interesting (relevant).

Posted: **Mon Jul 26, 2021 3:32 pm**

ryanheise wrote:From an analysis of 40,000 episodes of English speaking podcasts, the following chart plots how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment:

Some takeaways:

3. The long tail after the 5,000 word mark constitutes the majority of content.

You've done some very cool and interesting things with the data!

For me, the charts would be a little easier to understand if the decimal fraction wasn't displayed. E.G. 1000 rather than 1000.00.

I'm very curious how you got takeaway #3. Eyeballing the chart makes it seem like before 5000 is as big (or bigger) than all the rest. "majority" only has to be 50.01% and it's easy to have the eye fool the mind. Do you have a percentage number for takeaway 3?

Also, curious if the podcasts were all converted to text by a Speech to Text app. I assume so. Also, that technology is very good these days. But can you comment?

Posted: **Mon Jul 26, 2021 4:29 pm**

luke wrote:
ryanheise wrote:Some takeaways:

3. The long tail after the 5,000 word mark constitutes the majority of content.

You've done some very cool and interesting things with the data!

For me, the charts would be a little easier to understand if the decimal fraction wasn't displayed. E.G. 1000 rather than 1000.00.

I couldn't figure out how to make Google Sheets format the x-axis that way, but you are welcome to try yourself

The spreadsheet is linked above.

I'm very curious how you got takeaway #3. Eyeballing the chart makes it seem like before 5000 is as big (or bigger) than all the rest. "majority" only has to be 50.01% and it's easy to have the eye fool the mind. Do you have a percentage number for takeaway 3?

Yes it is a bit of a mind warp. In reality, the tail constitutes 52% of all content. That is, if you stack each of the buckets after 5000 on top of each other, and stack each of the buckets before 5000 on top of each other, those two towers would be 48% and 52%. Feel free to make a copy of the doc and use a COUNTIF to verify.

I would also add that I actually truncated the tail when uploading the data because otherwise the tail would have been much longer to the point where the the interesting part of the graph would have been compressed into oblivion.

Also, curious if the podcasts were all converted to text by a Speech to Text app. I assume so. Also, that technology is very good these days. But can you comment?

When I was analysing Japanese and Korean podcasts before, I used Google's text-to-speech service and it does cost money (luckily I had free access at that time). However, fortunately many English podcasts come with ready-made transcripts which is how I was able to do the above analysis on such a large scale. I will eventually want to get back to Japanese since that's what I'm ultimately interested in. I would actually be interested in starting an effort to assist creators of foreign language podcasts to create transcripts like what we have in the English speaking podcast world. Maybe through crowdsourcing, or maybe through sponsoring the creator through Patreon so that they can afford to pay a transcription service. I suspect that most podcast creators would not have the interest to make transcripts on their own, but on the other hand, I'm sure the language learning community would be motivated to help out since we are the ones who would really benefit from that.

Just FYI, there are already many German podcasts and Spanish podcasts with transcripts so once I refine my analysis, I can repeat it for those two languages immediately. For Japanese (due to personal interest in that language), I'm willing to fork out a little bit of money to pay for automated transcription, and maybe I can email all of those Japanese podcast creators and send them their transcripts in case they want to link to them.

Posted: **Wed Jul 28, 2021 1:34 pm**

ryanheise wrote:If we examine every book, every movie, or every utterance in some given language, and divide all of that language up into buckets of different difficulty levels, how would the language be distributed across those difficulty buckets? Is most language intermediate? Is most language advanced? How much of naturally occurring language is beginner level?

From an analysis of 40,000 episodes of English speaking podcasts, the following chart plots how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment:

This is a rough measurement helped by some simplifying assumptions:
...
1. The meaning of comprehension is baked in at 98% minimum known words.
2. Words are learnt in the order of frequent to infrequent.
3. I count each distinct word form, not lemma.

...
Some takeaways:

1. There is a relative shortage of natural content that is simple enough for beginners to understand.
2. Learning vocabulary is more rewarding for the first 5,000 words, after which a natural plateau is reached.
3. The long tail after the 5,000 word mark constitutes the majority of content.

....

Kudos for some very nice work but I have a couple of questions of a theoretical and methodological nature. First of all is the idea that vocabulary token size is an accurate indicator of difficulty, especially in a non-lemmatized corpus. In such a corpus the word token count is totally dependent on the length of the sample. This does not tell us how many different word types or word families are present.

This question of word types is particularly important when evaluating the difficulty of vocabulary. In most vocabulary studies the assumption is that the distribution of word types in a sample is similar to the frequency distribution of word types in the language. For example a 1000-word (type) document would contain a lot of the 1000 most frequent words in the language and is therefore much easier to understand than a 5000-word (type) document that uses more infrequent words.

If we are talking about just raw token count where the document length is unknown, this relationship – which in itself is rather approximate – breaks down nearly completely. As can be easily imagined a 1000-word (token) technical document could be harder to understand than a 5000-word (token) of easy content.

This raises of course the issue of sample genre. 40000 podcasts represent a very wide universe. We can imagine the vast array of subjects that all have some specific vocabulary that must understood.

So what we see in the first chart is the distribution of documents by word token count and not much more. I find it hard to interpret in terms of vocabulary levels required to understand a certain number of podcasts. And all the more so that we are talking about raw tokens and not word types or families. Just what does it mean to have a vocabulary of 5000 word tokens ?

Finally, it seems to me that vocabulary size has a cumulative effect which is not reflected in this first chart,

All of these issues have been analyzed in great depth by Paul Nation and others in their work on optimal vocabulary size for various language genres including movies and academic writing. I suggest the OP should have a look starting with https://www.wgtn.ac.nz/lals/resources/paul-nations-resources.

Posted: **Wed Jul 28, 2021 4:59 pm**

s_allard wrote:Kudos for some very nice work but I have a couple of questions of a theoretical and methodological nature.

Thanks for the questions! I appreciate the discussion.

First of all is the idea that vocabulary token size is an accurate indicator of difficulty, especially in a non-lemmatized corpus.

Don't worry, I'm not doing that. I'm using a lemmatised corpus when estimating difficulty. That is introduced by the second graph.

If you're looking at the first graph, which is not lemmatised (but at the same time does not estimate difficulty), as stated, that was just for the purpose of doing a very simple analysis of how many documents contain 98% known words (tokens) for each given bucket of (token) vocabulary size. I mentioned in the original post that I do not think that is the same thing as difficulty. But the point is that the distribution curve will actually be roughly the same no matter what word counting method you use, it will just be on a different scale or coordinate system so to speak. This is inherent in the nature of the information being processed rather than the processing method. That is, there will still be a shortage of beginner content, and there will still be a long tail of advanced content. Or putting it another way, there is a hard bound on how easy content can get, but there is no bound on how difficult content can get. Every year people churn out new Ph.D. dissertations which are continually inventing new and rare technical terms.

While this is apparently the situation for natural content, I suspect that it's the opposite situation for teaching material: here, there is an over-abundance of books for beginners (maybe because it has a larger target market). I am however interested in natural content, and although the situation is not great for beginner learners, I hope to eventually rank a massive number of podcast episodes so that those needles in the haystack will be sorted and collected all in one place where they will be a bit easier to find.

In such a corpus the word token count is totally dependent on the length of the sample. This does not tell us how many different word types or word families are present.

This question of word types is particularly important when evaluating the difficulty of vocabulary. In most vocabulary studies the assumption is that the distribution of word types in a sample is similar to the frequency distribution of word types in the language. For example a 1000-word (type) document would contain a lot of the 1000 most frequent words in the language and is therefore much easier to understand than a 5000-word (type) document that uses more infrequent words.

If we are talking about just raw token count where the document length is unknown, this relationship – which in itself is rather approximate – breaks down nearly completely. As can be easily imagined a 1000-word (token) technical document could be harder to understand than a 5000-word (token) of easy content.

Don't worry, I am not talking about "just raw token count where the document length is unknown". To use even the two basic methods I gave in my previous post (describing the difficulty calculation), both the geometric mean and arithmetic mean would for example treat the 5000-word "easy" document as having a roughly equivalent difficulty as the first 1000 words alone of that same document, assuming the whole document is roughly the same level of easy all the way through. That is, the mean of 1 3 2 1 3 2 1 3 2 is the same as the mean of 1 3 2. I'm doing something a bit more advanced than that, but I can assure you what I'm doing scales to the document size. Naturally I would only want to consider algorithms that are stable across different scales.

This raises of course the issue of sample genre. 40000 podcasts represent a very wide universe. We can imagine the vast array of subjects that all have some specific vocabulary that must understood.

That is from memory one of the themes in my other linked topic "SRS vs natural repetition". What I posit there is that although a specialised word may be rare in the entire corpus, a language learner may be able to artificially skew the frequency graph by filtering down to documents that cover a certain topic. That is actually one of the useful types of analysis that I hope will come out of this project.

Just what does it mean to have a vocabulary of 5000 word tokens ?

I don't think many people have an intuitive feel for what it's like to have a vocabulary of 11,000 "word families" let alone 5,000 word tokens. But in one sense, you don't need to know what it absolutely means, only what it relatively means. In practice, we just need a scale that we can move along. word tokens are just a magnified scale of word families, so the curve will be roughly the same, just stretched to a different x-axis scale. (i.e. for each word family, there is on average some multiple of that of word tokens.)

Finally, it seems to me that vocabulary size has a cumulative effect which is not reflected in this first chart,

Nor the second. But that is completely intentional.

This is fairly inconsequential, though. All of the raw data is there in the spreadsheet, and it is entirely possible to create different visualisations of the same data, e.g. as a cumulative graph if that is what you would find interesting. However, I set out specifically with a goal of placing each document into a bucket, not placing the same document into multiple buckets (i.e. I do not want to place the beginner podcasts into the intermediate bucket just because an intermediate learner could understand them.) because I was more interested in the distribution of content among these buckets. I encourage you to create further visualisations and share them below.

All of these issues have been analyzed in great depth by Paul Nation and others in their work on optimal vocabulary size for various language genres including movies and academic writing. I suggest the OP should have a look starting with https://www.wgtn.ac.nz/lals/resources/paul-nations-resources.

Although I respect Paul Nation as a linguist, it's just worth keeping in mind that computational linguistics is not his area of expertise, and his work in this area uses rather primitive computational techniques that are not necessarily accurate or cutting edge. It may also be interesting just to point out that my first graph actually uses techniques from one of Paul Nation's papers. Specifically, the method of computing 98% comprehensibility based on knowing 98% of the words in the document. Although inaccurate, I've used Paul Nation's model here only for simplicity of the calculations, because all I was really interested in was demonstrating the shape of the curve, not the specific result for each specific document. And as you can see above, the simple calculation of Paul Nation suffices for that purpose. However, I have explained elsewhere on this form (as well as above), that comprehensibility in practice is a lot more than what is measured in Paul Nation's simple model. Comprehensibility is less a matter of how many words in the document you understand, but moreso how many sentences in the document you understand, and that includes not only understanding the key nouns of the sentence but also the grammar.

See this comment for one example.

Posted: **Thu Jul 29, 2021 2:02 am**

ryanheise wrote:…

From an analysis of 40,000 episodes of English speaking podcasts, the following chart plots how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment:

This is a rough measurement helped by some simplifying assumptions:
...
1. The meaning of comprehension is baked in at 98% minimum known words.
2. Words are learnt in the order of frequent to infrequent.
3. I count each distinct word form, not lemma.

...

Thanks for the interesting answers to my comments. I don’t agree with all the answers but they are food for thought. Here I want to revisit the statement that I disagree with fundamentally : … « how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment: »

Let’s take that magic number of 5000 words or more specifically 5000 unlemmatized tokens. We are not talking about unique words or word families. We’re looking at the number of units that Microsoft Word will display at the bottom of the screen. So the more you write or talk, the larger your vocabulary word token count, not necessarily the number of different words you know.

How many unique words are there in that 5000 token count ? To simplify things, I’ll use a typical figure of 40% Type to Token Ratio. This gives us a vocabulary of 2000 word families or unique words.

We are told that with this vocabulary size and according to the first chart a speaker could theoretically understand around 9000 podcasts of 5000 word tokens each. In fact we are explicitly told that at this vocabulary sweet spot, the speaker could understand the majority of all the podcast with 98 % word coverage.

There is something wrong here. I won’t rehash the work of Paul Nation whose computational linguistics credentials may be a bit long in the tooth but there is a basic methodological issue here that Nation thoroughly explored. To state that with 5000 tokens you can understand all 9000 podcasts of 5000 tokens is to assume that all these podcasts contain the identical tokens.

In reality, once we get beyond the core repetitive vocabulary of let’s say around 1000 word families, the vocabularies of each podcast will start to diverge according to the contents and the writing style of the authors. So maybe you can read a Harry Potter with 2000 word types. But throw in a John Grisham USA legal thriller and you’ll have to add some vocabulary specific to that author. I’m sorry that I’m not familiar with any Aussie writers but I’ll add a Canadian author, Margaret Atwood, and now the vocabulary to read all three authors is more like 4000 word types if not more. If you want to read a wide range of modern literature, you’ll need a vocabulary easily in the 20000 word types range if not more.

So what vocabulary size is needed to get 98% word coverage of all 9000 podcasts ? I don’t know. But I do believe that the relatively puny vocabulary of 5000 tokens or 2000 types will not come even close.

I do however think the chart is accurate in that it represents the distribution of word token sizes of podcasts. In fact, I came up with some similar figures by looking at the various lengths of podcasts and speaking rates. Assuming a speaking rate of 140 words a minute, a 5000-word podcasts lasts approximately 30 minutes (allowing for some pauses). A 1000-word is around 6 minutes and a two-hour podcast is approximately 18000 word tokens.

I stand by my statement that the chart may show interesting results but it does not show the number « of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment: »

Posted: **Thu Jul 29, 2021 2:57 am**

The vocabulary needed to reach 98% coverage of any given podcast will of course be much smaller than the vocabulary needed to reach 98% coverage of all of the podcasts together. Even if the average vocabulary needed for each podcasts needed for each podcast is not large, there may not be much overlap.
Or maybe I missed the point :lol:

A language learners’ forum

The statistical distribution of language difficulty

The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty