s_allard wrote:ryanheise wrote:…
From an analysis of 40,000 episodes of English speaking podcasts, the following chart plots how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment:
This is a rough measurement helped by some simplifying assumptions:
...
1. The meaning of comprehension is baked in at 98% minimum known words.
2. Words are learnt in the order of frequent to infrequent.
3. I count each distinct word form, not lemma.
...
Thanks for the interesting answers to my comments. I don’t agree with all the answers but they are food for thought. Here I want to revisit the statement that I disagree with fundamentally : … « how many of those episodes you would theoretically be able to understand with a vocabulary size of 1000, 2000, 3000, and each subsequent +1000 increment: »
You seem to have misunderstood.
"Known words" (x-axis) is your vocabulary size, while the y-axis is how many podcast episodes exist where 98% of the tokens in it are within your vocabulary.
So given the simplifying assumption that the learner is learning words off my set vocabulary list (which in my case is a word frequency list generated by the entire corpus, but in Paul Nation's case is a set vocabulary list for a pre-defined grade for his graded readers), what the graph means is that:
* Knowing the top 1000 most frequent words in the corpus, you will know 98% of the tokens in 15 documents.
* Knowing the top 2000 most frequent words, you will know 98% of the tokens in an additional 105 documents.
* Knowing the top 3000 most frequent words, you will have 98% coverage of an additional 2653 documents.
So on the histogram, the first 3 buckets contain 15, 105 and 2653. That is all.
Just to put it another way, the x-axis describes your vocabulary size (labelled "known words"), while the y-axis describes how many documents you would be able to comprehend with that vocabulary size, using a Paul Nation-style method of estimating comprehensibility.
Let’s take that magic number of 5000 words or more specifically 5000 unlemmatized tokens. We are not talking about unique words or word families. We’re looking at the number of units that Microsoft Word will display at the bottom of the screen. So the more you write or talk, the larger your vocabulary word token count, not necessarily the number of different words you know.
The size of the document is irrelevant in Paul Nation's method, which I am using here. You don't simply count how many words are in the document, you start with the learner's set of known words (where we classify the learner by these artificial buckets), and THEN you look at every word in a given document and see whether it is one of the words that this artificial learner knows. Once you get to the end of the document, you look at what "percentage" of the words in the document were already known by the learner. If that percentage is above a certain comprehensibility threshold, in this case 98%, then we say that document is comprehensible given that vocabulary. That is what the first graph shows, a basic Paul Nation-style calculation, but just using different buckets. The same style of analysis is applicable for different word counting methods (e.g. word families, lemmas, etc.). In that case, you normalise each word before entering it into the bucket's vocabulary list, and you also normalise the word before you look that word up in the vocabulary list.
How many unique words are there in that 5000 token count ? To simplify things, I’ll use a typical figure of 40% Type to Token Ratio. This gives us a vocabulary of 2000 word families or unique words.
We are told that with this vocabulary size and according to the first chart a speaker could theoretically understand around 9000 podcasts of 5000 word tokens each. In fact we are explicitly told that at this vocabulary sweet spot, the speaker could understand the majority of all the podcast with 98 % word coverage.
Based on the above comments, I will assume this point is cleared up.
There is something wrong here. I won’t rehash the work of Paul Nation whose computational linguistics credentials may be a bit long in the tooth but there is a basic methodological issue here that Nation thoroughly explored. To state that with 5000 tokens you can understand all 9000 podcasts of 5000 tokens is to assume that all these podcasts contain the identical tokens.
It may be easier if you assume that I am familiar with Paul Nation's work
But in any case, hopefully the above answers help to clarify the methodology and what the graph depicts. "Known words" describes the learner's vocabulary size, divided into arbitrary buckets according to assumptions.