s_allard wrote:Beli Tsar wrote:s_allard wrote:I have asked a number of times : Don’t the podcasts within the same histogram or bucket have to have nearly identical vocabularies ? The answer is no as long as 98% of their vocabulary comes from the 5000-word list. But that said, I tend to think that podcasts of similar vocabulary breadth, i.e. similar unique word sizes would tend to have very similar sets of word tokens. For example, two podcasts with each the same 5000 unique words from my vocabulary set would have nearly identical token vocabularies with small differences in the two sets of 100 words that are not in my vocabulary.
This is simply not true, and the examples the OP and rdearman have given do make this clear. There is no requirement for the podcasts to cover all the words on the list, so they can differ both in the words that are in your vocabulary and in those that are not.
It's just the same as the way graded readers can be about radically different subjects while still sticking to a very narrow set of words - often much narrower than 5000!
I think I said quite clearly in my own answer that podcasts in the same histogram generally do not need to have identical vocabularies. But what happens when multiple podcasts contain a very large proportion of the listener's vocabulary? So a number of podcasts each contain 5000 unique words and I can fully comprehend each podcast with 98% word coverage. Doesn't that mean that each podcast has the same common 5000 words from my list plus the little two percent of individual variation? What other words can there be?
This is perhaps hypothetically true, but is not a realistic, real-life occurrence. Does any podcast - with 5000 unique words - exist?
Unlike Ryanheise, I don't have a podcast corpus to test this on. But to do a quick and dirty (Bayesian-style?) confirmation I used what I had. I'm a regular public speaker/have a youtube channel etc, not so different from podcasting, so I analysed a few scripts for unique words. There were between 700 and 760 for talks that were roughly around the 20-28 minute mark, and remarkable consistency within that range, so that 760 seems to be a high outlier. These are non-technical talks for normal people, but still, they aren't simplified for language learners, or anything like that. This seems to confirm the idea that we don't use that many unique words in this kind of communication - something that fits with your own philosophy of language learning and vocabulary acquisition, as I understand it from your older posts?
Any podcast with 5000 unique words would have to be either many, many hours long or near-incomprehensible jibberish.