SpanishInput wrote:For example, in my list gathered from Netflix subs the word "güey" is in the top 5,000, even though my mom has no idea what it means.
That's why the corpus size matters, just like its sources (literature, newspapers, TV/radio/cinema, speech). Curated frequency lists take that into account.
SpanishInput wrote:This is why Imron from the "Chinese the hard way" blog (and also administrator of Chinese Forums) says that after you reach 1,200 words (HSK 4), it's waaaay more efficient to focus on words that are frequent in the book you're reading right now.
I don't know the situation with Chinese and other character-based languages. But when I start reading unadapted books in Western languages, I know the core 5k words by that time and at that stage you don't have that many frequently occured words. And if you have them, you don't have to focus deliberately since due to their repetition they catch your attention anyway. The problem is usually with hapaxes (and dis legomenon, tris legomenon, and tetrakis legomenon). For instance, take a look at
the statistical distribution in the Bible which, I think, reflects the statistical distribution of words in an average book. 2k words only appear once! If you put hapax, dis, tris, and tetrakis together, they give you a whopping 3500 words in total (out of total 5300). You won't be able to store them in your memory by extensive reading of the Bible alone. There's simply not enough repetitions. The question then: which of these 3500 you would rather concentrate first? As Iversen rightfully said, only a small fraction of these hapaxes will overlap with the words in your next book. How to choose?I n this case, intuition is not a reliable guide. Yes, you see bananas in your supermarket everyday but how often do you discuss them in your real life? How often you read about them in newspapers or books? People don't discuss bananas in everyday life and don't write novels about them. Yet this is what you usually get in textbooks — long lists of fruits, vegetables, clothes, furniture, professions and so on. As a result,
as some studies suggest, students are ill equipped for reading actual texts because of the size and sampling of the textbook vocabulary:
The results of the frequency analysis of the vocabulary used in three current textbooks for begin- ners of German are somewhat disheartening. In all three books the percentage of vocabulary less frequent than the frequency rank 4,000 is high (29-44%) . These percentages may be partly due to issues of practicality in creating a textbook, like classroom management vocabulary, students' interest, chapter topics, story line of the book, etc. But this is certainly only partially the case. Not all lowfrequency words used in these books are connected to either classroom management or students' interests. To be sure, it can be debated how many low-frequency words should be included in first-year German textbooks. One should also keep in mind that, psycholinguistically, these words might contribute to an overload of students' capacities and lead to frustration.
Most importantly, learners should be familiar with high-frequency words. As far as sufficient text coverage and further vocabulary learning are concerned, the most-frequent 1,000 words are of such importance in language learning that teaching these words now appears to be absolutely essential. It is striking that only 64% and 61 % of the most-frequent 1,000 words are included in Deutsch heute and Neue Horizonte, respectively. Even more noteworthy is the fact that Kontakte teaches only 53% of the most high-frequency words.
Don't get me wrong, frequency dictionaries have their methodological limitations, and I'm perfectly aware of them. But when used appropriately and for the right purpose (i.e. reading), they are fantastic tools. I find them indispensable for developing reading skills, since they significantly accelerate the process.