Hi Luke
This is not in lieu of a response. I'm just saving some previous handy and relevant research here instead of leaving it strewn in random threads.
How many words are needed to do the things a language user needs to do?
Although the language makes use of a large number of words, not all of these words are equally useful. One measure of usefulness is word frequency, that is, how often the word occurs in normal use of the language. From the point of view of frequency, the word the is a very useful word in English. It occurs so frequently that about 7% of the words on a page of written English and the same proportion of the words in a conversation are repetitions of the word the. Look back over this paragraph and you will find an occurrence of the in almost every line.
The good news for second language learners and second language teachers is that a small number of the words of English occur very frequently and if a learner knows these words, that learner will know a very large proportion of the running words in a written or spoken text. Most of these words are content words and knowing enough of them allows a good degree of comprehension of a text. Here are some figures showing what proportion of a text is covered by certain numbers of high frequency words.
Table 1: Vocabulary size and text coverage in the Brown corpus
Vocabulary size Text coverage
1000 72.0%
2000 79.7%
3000 84.0%
4000 86.8%
5000 88.7%
6000 89.9%
15,851 97.8%
The figures in Table 1 refer to written texts and are from Francis and Kucera (1982) which is a very diverse corpus of over 1,000,000 running words made up of 500 texts of around 2000 running words long. As we shall see the more diverse the texts in a corpus, the greater the number of different words and the high frequency words cover slightly less of the text, so these figures are a conservative estimate. The figures in the last line of the table are from Kucera (1982). The COBUILD Dictionary claims that 15,000 words cover 95% of the running words of their corpus. The figures in Table 1 are for lemmas and not word families. Word families would give fractionally higher coverage. Table 1 assumes that high frequency words are known before lower frequency words and shows that knowing about 2,000 word families gives near to 80% coverage of written text.
The same number of words gives greater coverage of informal spoken text - around 96% (Schonell, Meddleton and Shaw, 1956).
How much vocabulary and how should it be learned?
We are now ready to answer the question "How much vocabulary does a second language learner need?" Clearly the learner needs to know the 3,000 or so high frequency words of the language. These are an immediate high priority and there is little sense in focusing on other vocabulary until these are well learned. Nation (1990) argues that after these high frequency words are learned, the next focus for the teacher is on helping the learners develop strategies to comprehend and learn the low frequency words of the language. Because of the very poor coverage that low frequency words give, it is not worth spending class time on actually teaching these words. It is more efficient to spend class time on the strategies of (1)
guessing from context, (2) using word parts and mnemonic techniques to remember words, and (3)
using vocabulary cards to remember foreign language - first language word pairs. Detailed description of these strategies can be found in Nation (1990). Notice that although the teacher's focus is on helping learners gain control of important strategies, a major function of these strategies is to help the learners to continue to learn new words and increase their vocabulary size.
A way to manage the learning of huge amounts of vocabulary is through indirect or incidental learning. An example of this is learning new words (or deepening the knowledge of already known words) in context through extensive listening and reading. Learning from context is so important that some studies suggest that first language learners learn most of their vocabulary in this way (Sternberg, 1987). Extensive reading is a good way to enhance word knowledge and get a lot of exposure to the most frequent and useful words. At the earlier and intermediate levels of language learning, simplified reading books can be of great benefit. Other sources of incidental learning include problem solving group work activities (Joe, Nation and Newton, 1996) and formal classroom activities where vocabulary is not the main focus.
The problem for beginning learners and readers is getting to the threshold where they can start to learn from context. Simply put, if one does not know enough of the words on a page and have comprehension of what is being read, one cannot easily learn from context. Liu Na and Nation (1985) have shown that we need a vocabulary of about 3000 words which provides coverage of at least 95% of a text before we can efficiently learn from context with unsimplified text. This is a large amount of startup vocabulary a learner needs, and this just to comprehend general texts. So how can we get learners to learn large amounts of vocabulary in a short space of time?
The suggestion that learners should directly learn vocabulary from cards, to a large degree out of context, may be seen by some teachers as a step back to outdated methods of learning and not in agreement with a communicative approach to language learning. This may be so, but the research evidence supporting the use of such an approach as one part of a vocabulary learning program is strong.
To these research based arguments might be added the argument that most serious learners make use of such an approach. They can be helped to do it more effectively. There are other advantages for using word cards. They can give a sense of progress, and a sense of achievement, particularly if numerical targets are set and met. They are readily portable and can be used in idle moments in or out of class either for learning new words or revising old ones. They are specifically made to suit particular learners and their needs and are thus self motivating.
It should not be assumed that learning from word lists or word cards means that the words are learned forever, nor does it mean that all knowledge of a word has been learned. Learning from lists or word cards is only an initial stage of learning a particular word (see Schmitt and Schmitt, 1995 for further information). It is however a learning tool for use at any level of vocabulary proficiency. There will always be a need to have extra exposure to the words through reading, listening and speaking as well as extra formal study of the words, their collocates, associations, different meanings, grammar and so on. This shows a complementary relationship between contextualized learning of new words and the decontextualized learning from word cards.
What vocabulary does a language learner need?
The previous sections of this paper have suggested that second language learners need first to concentrate on the high frequency words of the language. In this section we look at some useful vocabulary lists based on frequency and review the research on the adequacy of the General Service List (West, 1953). Most counts also consider range, that is the occurrence of a word across several subsections of a corpus...
The second 1000 words behave in this way because they are lower frequency words than the first 1000 words and have a narrower range of occurrence. That is their occurrence is more closely related to the topic or subject area of a text than the wide ranging more general purpose words in the first 1000. But given a range of topics and genres, and enough texts, the second 1000 words are more generally useful than other lists of words.
After the 2000 high frequency words of the GSL, what vocabulary does a second language learner need? The answer to this question depends on what the language learner intends to use English for. If the learner has no special academic purpose then the learner should work on the strategies for dealing with low frequency words. If however the learner intends to go on to academic study in upper high school or at university, then there is a clear need for general academic vocabulary. This can be found in the 836 word list called the University Word List (UWL) (Xue and Nation, 1984; Nation, 1990).
The UWL consists of words that are not in the first 2000 words of the GSL but which are frequent and of wide range in academic texts. Wide range means that the words occur not just in one or two disciplines like economics or mathematics, but occur across a wide range of disciplines. Here are some items from it.
accompany formulate index major objective
biology genuine indicate maintain occur
comply hemisphere individual maximum passive
deficient homogeneous job modify persist
edit identify labour negative quote
feasible ignore locate notion random
(Nation, 1990)
The value of the UWL can be seen when we look at the coverage of academic text that it provides.
Note the low coverage the UWL has of fiction. Newspapers and magazines which are more formal make use of more of the UWL.
Very formal academic text makes the greatest use of the UWL. The UWL is thus a word list for learners with specific purposes namely academic reading. The purpose behind the setting up of the UWL is to create a list of high frequency words for learners with academic purposes, so that these words can be taught and directly studied in the same way as the words from the GSL can.
Word frequency lists
The major theme of this paper has been that we need to have clear sensible goals for vocabulary learning. Frequency information provides a rational basis for making sure that learners get the best return for their vocabulary learning effort. Vocabulary frequency lists which take account of range have an important role to play in curriculum design and in setting learning goals.
This does not necessarily mean that learners must be provided with large vocabulary lists as the major source of their vocabulary learning. It does mean however that course designers should have lists to refer to when they consider the vocabulary component of a language course...
The following list suggests several of the factors that would need to be considered in the development of a resource list of high frequency words.
1 Representativeness The corpora that the list is based on should adequately represent the wide range of uses of language.
Frequency and range Most frequency studies have given recognition to the importance of range of occurrence. A word should not become part of a general service list because it occurs frequently. It should occur frequently across a wide range of texts. This does not mean that its frequency has to be roughly the same across the different texts, but means that it should occur in some form or other in most of the different texts or groupings of texts.
3 Word families
4 Idioms and set expressions Some items larger than a word behave like high frequency words. That is, they occur frequently as a unit (Good morning, Never mind), and their meaning is not clear from the meaning of the parts (at once, set out). If the frequency of such items is high enough to get them into a general service list in direct competition with single words, then perhaps they should be there. Certainly the arguments for idioms are strong, whereas set expressions could be included under one of their constituent words
5 Range of information To be of full use in course design, a list of high frequency words would need to include the following information for each word - the forms and parts of speech included in a word family, frequency, the underlying meaning of the word, variations of meaning and collocations and the relative frequency of these meanings and uses, and restrictions on the use of the word with regard to politeness, geographical distribution etc.
http://www.robwaring.org/papers/CUP/cup.htmlVocabulary Range and Text Coverage:
Insights from the Forthcoming
Routledge Frequency Dictionary of Spanish
"In the following table -- which represents the main conclusions of this study -- we see the percent coverage of all tokens in three different registers (oral, fiction, and non-fiction) at three different levels of lexemes -- top 1000 words, top 2000 and top 3000.
Table 3. Percent coverage of tokens by groups of types/lemma
"
As the data indicate, a limited vocabulary of 1000 words would allow language learners to recognize between 75-80% of all lexemes in written Spanish, and about 88% of all lexemes in spoken Spanish (which is due to the higher repetition of basic words in the spoken register). Subsequent extensions of the base vocabulary have increasingly marginal importance. By doubling the vocabulary list to 2000 words, we account for only about 5-8% more words in a given text, and the third thousand words in the list increases this only about 2-4% more. There clearly is a law of “diminishing returns” in terms of vocabulary learning."
The data from Spanish and English are roughly comparable, but there is an important difference in the way in which the data was obtained. In Nation (2000), the words are grouped by what he calls “word families”, so that [courage, discouragement, encourage] would all be grouped under the headword [COURAGE], and [paint, painted, painter, painting] would all be grouped under the headword [PAINT]. In our study, however, we used the traditional lemma approach, in which pintar, pintura, pintor, and pintoresco would all be assigned to different lemma, and [pintamos, pinto, and pintarás] would all be assigned to the lemma [PINTAR]. Because we separate the nominal, verbal, and adjectival uses, we might expect that the same number of headwords would lead to less text coverage than in English. The fact that this does not happen, however, is probably due to the fact that English has a larger lexical stock than Spanish, due to the influence of native Anglo-Saxon and imported Franco-Norman and Latinate words (e.g. real, royal, regal). The fact that the same amount of lexemes in German leads to lower textual coverage is somewhat more difficult to explain. It may be due to the still-incomplete state of the German tagger (Jones, p.c.). Or again, it may be due to a generally larger lexical stock in German than in Spanish, though this is much more debatable.
10. Conclusion
Hopefully the preceding discussion provides some useful insight into the issue of vocabulary range and text coverage, and the way in which the extracted data can be used to create a more useful frequency dictionary of Spanish.
From the point of view of a language learner, the important point is that text coverage clearly obeys the law of diminishing returns. With about 4000 words, a language learner would be able to recognize more than 90% of the words in a typical native speaker conversation. If s/he learns two thousand more words, however, this will increase coverage by only about 3-4%. We have also seen that the degree of coverage is a function of register and part of speech, and have provided detailed data to support this view. We have also considered the role of vocabulary range, and how factors such as register affect this as well."
http://www.lingref.com/cpp/hls/7/paper1091.pdfSelecting Television Programs for Language Learning: Investigating Television Programs from the Same Genre
"In a corpus-driven study looking at the number of words needed to understand the vocabulary in television programs, Webb and Rodgers (2009a) found that
a vocabulary size of 3000 word families plus proper nouns and marginal words provided 95.45% coverage of a corpus made up of 88 television programs from a variety of genres."
"Webb and Rodgers (2009a) findings also shed light on differences between television genres.
Children’s programs were found to have the smallest vocabulary load; the most frequent 2000 word families, plus proper nouns and marginal words accounted for 95% coverage. The most frequent 3000 word families plus proper nouns and marginal words accounted for 95% of American drama, older programs, situation comedies and British programs. The genres with the greatest proportions of low frequency words were news stories and science fiction programs. Results also indicated that coverage is likely to vary between episodes of programs leading Webb and Rodgers to suggest that randomly viewing programs may limit comprehension. Instead they proposed watching programs from within the same subgenre that have similar topics and storyline."
https://www.researchgate.net/publicatio ... Same_Genre