s_allard wrote:I hate to be seen as nagging here but I really have a problem understanding just exactly how many documents I can understand with a vocabulary size of 5000 word types or families.
By the way, the fact that the lists are lemmatized changes much of the discussion.
When I look at the chart it’s difficult to see the exact number of podcasts that correspond to a 5000-word vocabulary size. It would seem to be about 8000 podcasts. As someone else has mentioned, I the language learner could choose from these 8000 documents and be sure that they are of my level, i.e. my vocabulary size will give 98% coverage of any randomly chosen document from the corresponding histogram.
Now that we are using lemmatized lists, I find these values more realistic but the fundamental problem remains. Don’t the 8000 documents each have to contain nearly identical words for my individual list to give me 98% coverage in all 8000 documents? How could it be otherwise ?
Quite easily, actually. Let me illustrate by another example. Suppose we have a fictional language containing only 26 words: "a", "b", all the way up to "z", and just by chance, "a" happens to be the most frequent word, "b" the next most frequent, all the way down to "z", the least frequent of them all.
Now let's compare two fictional documents:
Document 1: a b c h x a b c d y
Document 2: w e f e h e f e f z
In fact, if you yourself possess a vocabulary of the 8 most frequent words, i.e. "a", "b", ... "h", and your target is 80% comprehension, then both of the above documents are comprehensible at 80% and yet have almost completely different words.
Document 1 uses these distinct words: a, b, c, d, h, x, y
Document 2 uses these distinct words: e, f, h, w, z
When it comes to Paul Nation's calculation it doesn't actually matter how many distinct words are used in a document. The only thing that matters is whether 80% (or whatever comprehension target you have) of the words in the document are in your vocabulary. So to Paul Nation's calculation, the two documents actually look like this:
Document 1: 1 1 1 1 0 1 1 1 1 0
Document 2: 0 1 1 1 1 1 1 1 1 0
where 1 means the word is within our vocabulary and 0 means it is not. After this analysis, in both cases, they are comprehensible at 80%.
But if those podcasts have individual vocabulary differences - as I suspect - despite having identical vocabulary sizes, then it's a whole different matter.
Your mistake again is to think about the vocabulary size of the document when what is actually being considered is the vocabulary size of the learner. The vocabulary size of the document is not a factor in Paul Nation's calculation.
For example, consider this very short sentence from a computer science text book:
"Java subclasses imply subtypes."
Not comprehensible to most people, I would say. And yet the document vocabulary size is only 4.
Now consider the following document.
I like playing with my friends and I like watching movies.
This is more comprehensible than the previous sentence, despite having a larger document vocabulary size of 9 words.
What matters in the comprehensibility estimation is not the vocabulary size of the document, but rather the vocabulary size of the learner. If the learner knows the top 5,000 most frequently used words in a language, they will be able to understand a "variety" of different documents that each use different subsets of that 5,000. A document will be classified into a bucket based on the least frequent word you need to know in order to understand the document at the target comprehension level. In my previous example, "h" was the least frequent word in the language that you needed to know in order to understand both documents at 80% comprehension, even though both documents used different subsets of the words "a" up to "h".