The statistical distribution of language difficulty
Posted: Sun Jul 25, 2021 2:28 pm
If we examine every book, every movie, or every utterance in some given language, and divide all of that language up into buckets of different difficulty levels, how would the language be distributed across those difficulty buckets? Is most language intermediate? Is most language advanced? How much of naturally occurring language is beginner level?
From an analysis of 40,000 episodes of English speaking podcasts, the following chart plots how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment:
This is a rough measurement helped by some simplifying assumptions:
1. The meaning of comprehension is baked in at 98% minimum known words.
2. Words are learnt in the order of frequent to infrequent.
3. I count each distinct word form, not lemma.
The full spreadsheet is here if anyone would like to play with the data or create different visualisations of it. Or if you'd like me to rerun the analysis with different assumptions, let me know below.
Some takeaways:
1. There is a relative shortage of natural content that is simple enough for beginners to understand.
2. Learning vocabulary is more rewarding for the first 5,000 words, after which a natural plateau is reached.
3. The long tail after the 5,000 word mark constitutes the majority of content.
The second chart below shows a similar result, but using a computed notion of language difficulty (which includes vocabulary size as just one factor among others):
There is little difference to notice visually between the two graphs, except that if you were to look at the list of actual podcast episodes in sorted order, the two lists would be in different orders, because the episode that requires the smallest vocabulary is not necessarily going to be the easiest podcast (e.g. if it has difficult grammar). So if we want to know which podcast would be most suitable for me to listen to next at my current level, that is where the difficulty scores in the second graph might be more interesting.
This analysis follows on a previous post SRS vs natural repetition which was based on the same data set. It's been a while since I've had a look at this data set (more due to health issues rather than laziness) but I suddenly felt like resurrecting it and ultimately getting it to a point where I can repeat the analysis for other languages, and publish the lists so that anyone can use it to find suitable podcasts to listen to.
From an analysis of 40,000 episodes of English speaking podcasts, the following chart plots how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment:
This is a rough measurement helped by some simplifying assumptions:
1. The meaning of comprehension is baked in at 98% minimum known words.
2. Words are learnt in the order of frequent to infrequent.
3. I count each distinct word form, not lemma.
The full spreadsheet is here if anyone would like to play with the data or create different visualisations of it. Or if you'd like me to rerun the analysis with different assumptions, let me know below.
Some takeaways:
1. There is a relative shortage of natural content that is simple enough for beginners to understand.
2. Learning vocabulary is more rewarding for the first 5,000 words, after which a natural plateau is reached.
3. The long tail after the 5,000 word mark constitutes the majority of content.
The second chart below shows a similar result, but using a computed notion of language difficulty (which includes vocabulary size as just one factor among others):
There is little difference to notice visually between the two graphs, except that if you were to look at the list of actual podcast episodes in sorted order, the two lists would be in different orders, because the episode that requires the smallest vocabulary is not necessarily going to be the easiest podcast (e.g. if it has difficult grammar). So if we want to know which podcast would be most suitable for me to listen to next at my current level, that is where the difficulty scores in the second graph might be more interesting.
This analysis follows on a previous post SRS vs natural repetition which was based on the same data set. It's been a while since I've had a look at this data set (more due to health issues rather than laziness) but I suddenly felt like resurrecting it and ultimately getting it to a point where I can repeat the analysis for other languages, and publish the lists so that anyone can use it to find suitable podcasts to listen to.