Team Me: Foxing Around

Continue or start your personal language log here, including logs for challenge participants
User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Team Me: Foxing Around

Postby reineke » Fri Jan 14, 2022 10:46 pm

TUS RESULTADOS

conservative

Has respondido SÍ al 59% de las palabras correctas.

Has respondido SÍ al 0% de las palabras inventadas.

Este resultado te otorga una puntuación corregida de 59% - 0% = 59%.

aggressive

Has respondido SÍ al 60% de las palabras correctas.

Has respondido SÍ al 10% de las palabras inventadas.

Este resultado te otorga una puntuación corregida de 60% - 10% = 50%.

This is close to my result in Dutch. The thing is, I understand Spanish a whole lot better than Dutch which is mostly opaque to me. I saw too many international cognates and few easy Romance cognates. It also felt like they were pulling too heavily from narrow, low frequency technical sources.

On the positive side my result was considerably higher in Italian. I rely more on context for a Spanish word to ring a bell in my head and if I look at it in isolation or if it's presented in awkward word form I may simply blank out.
3 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Team Me: Foxing Around

Postby reineke » Sat Jan 15, 2022 4:23 pm

1 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Team Me: Foxing Around

Postby reineke » Sat Jan 15, 2022 4:36 pm

"This site contains what is probably the most accurate word frequency data for English. The data is based on the one billion word Corpus of Contemporary American English (COCA) -- the only corpus of English that is large, up-to-date, and balanced between many genres.

When you purchase the data, you have access to four different datasets, and you can use whichever ones are the most useful for you. Short samples are given below for each of these datasets, and you can also see much more complete samples (every tenth entry), as well as (new in Jan 2021) free copies of the top 5,000 entries for each list.

1 The most basic data shows the frequency of each of the top 60,000 words (lemmas) in each of the eight main genres in the corpus. Unlike word frequency data that is just based on web pages, the COCA data lets you see the frequency across genre, to know if the word is more informal (e.g. blogs or TV and movies subtitles) or more formal (e.g. academic).

Another dataset shows the frequency not only in the eight main genres, but also in nearly 100 "sub-genres" (Magazine-Sports, Newspaper-Finance, Academic-Medical, Web-Reviews, Blogs-Personal, or TV-Comedies, etc).

3 A third dataset shows the frequency of the word forms of the top 60,000 lemmas:

A final dataset shows the top 219,000 words in the billion word corpus -- each word that occurs at least 20 times and in 5 different texts. In this list, the words are not lemmatized (e.g. each form of a word is listed separately from other forms) and the words are not tagged for part of speech. For each word, it shows in which genres it is the most common.

Our data is based on two different corpora: the 14 billion word iWeb corpus, and the Corpus of Contemporary American English (COCA). COCA is the only large corpus of English that is large (one billion words), up-to-date (the latest texts are from late 2019), and which is based on a wide range of genres (e.g. blogs and other web pages, TV/movie subtitles, (more formal) spoken, fiction, newspapers, magazines, academic writing). Most of the following refers to the COCA word lists.

Why worry about what corpus is used? After all, there are many English word lists and frequency lists out on the Web (see in particular the British National Corpus and the American National Corpus). Some are good, and others are very poor in quality. Not all frequency lists are created equal.

One should be very, very suspicious of word lists that are taken from messy web data, outdated texts, or corpora that are too small to effectively model what is happening in the real world. Or worse, word lists that don't give you any idea what they are based on. As the saying goes: "garbage in (bad texts), garbage out (frequency lists)".

Here's some questions you might ask yourself as you consider downloading or purchasing a word list:

Depth and accuracy. Why do so many wordlists on the web contain just the top 1000-3000 words of English? Why not the top 20,000 or 60,000? It's because even a bad corpus (the collection of texts that the word lists are based on) can produce a moderately accurate list for the very most frequent words. But because the corpus is neither deep nor balanced enough, you start getting messy data for medium and lower frequency words. Ask to see samples of the top 20,000 or 60,000 words (e.g. every 7th or 10th word). If they don't have it, then you should be very, very suspicious of that word list.

Genres. Does the corpus contain texts from a wide variety of genres -- spoken, fiction, popular magazines, newspapers, and academic journals? Frequency lists that are based on just one of these may only contain 40-50% of the words from a more balanced corpus. The COCA data is based on the Corpus of Contemporary American English, which is almost perfectly balanced across genres.

Size. COCA contains about one billion words of text, and each of the top 20,000 words occurs ~1000 times or more. In a small 10-20 million word corpus, some of these words would occur just 7-8 times. At that point, the lower frequency words might make it into the list "by chance", whereas others are left out. No such problem with COCA. (And iWeb is 25 times as large as COCA).

How recent is it? Language change happens. If the word list is based on 15-20 year-old texts (or much worse, 100 year old public domain novels), then it will be missing many of the words from the modern language. COCA is based on texts from 1990-2019 (28 million words each year, plus blogs and other web pages from 2012-13) and iWeb was collected in 2017 -- or in other words, virtually right up to the current time.

Are they just word forms? Do you really want to see the individual frequency of shoe and shoes, or realize, realizes, realized, and realizing? Do you want to have the combined frequency of watch as a verb (they watch TV) and watch as a noun (his watch broke)? If the lists are simply taken from pages that are "scraped" from the web, they will just provide long lists of words, without grouping them meaningfully (e.g. shoe/shoes), or separating them when necessary (e.g. watch as a noun and as a verb). Both the COCA and iWeb word lists show the lemma (e.g. decide = decide, decides, decided, deciding) and group by part of speech (e.g. watch as a noun and as a verb).

In addition to frequency lists for English, we also have what we believe are the most accurate frequency lists for Spanish, containing the top 20,000 lemmas / words in the language. The Spanish data is based on the 20 million words from the 1900s in the 100 million word Corpus del Español, which is the only corpus of Spanish that is 1) large 2) balanced across genres (spoken, fiction, newspaper, academic), and 3) which is accurately tagged for part of speech and lemma (which is necessary to create a frequency dictionary).

In addition to frequency lists for English, we also have what we believe are the most accurate frequency lists for Portuguese, containing the top 20,000 lemmas / words in the language. The Portuguese data is based on the 20 million words from the 1900s in the 45 million word[b] Corpus do Português, which is the only corpus of Portuguese that is 1) large 2) balanced across genres (spoken, fiction, newspaper, academic), and 3) which is accurately tagged for part of speech and lemma."


https://www.wordfrequency.info/intro.asp

List of lexical databases (non-exhaustive)

Focused on frequency and lexical characteristics

Celex (Dutch, English, German; frequencies, phonology, orthography, morphology)

Webinterface: http://celex.mpi.nl

Manuals: http://www.ldc.upenn.edu/Catalog/docs/LDC96L14/

Source files: http://crr.ugent.be/*****


BNC (British English; frequencies)

Sources: http://www.kilgarriff.co.uk/bnc-readme.html


Lexique (French; frequencies, phonology, morphology, reaction times)

Web-interface: http://www.lexique.org

Sources: http://www.lexique.org/telLexique.php


Subtlex-US (American English; subtitle frequencies)

http://subtlexus.lexique.org/


Subtlex-NL (Dutch; subtitle frequencies, parts of speech)

http://crr.ugent.be/subtlex-nl/


Subtlex-CH (Chinese; Subtitle frequencies, parts of speech)

http://expsy.ugent.be/subtlex-ch/


Subtlex-GR (Greek; Subtitle frequencies)

http://www.bcbl.eu/databases/subtlexgr


Subtlex-DE (German; Subtitle frequencies)

http://crr.ugent.be/subtlex-de/


Subtlex-ESP (Spanish; Subtitle frequencies)

http://crr.ugent.be/archives/679


Subtlex-PL (Polish; Subtitle frequencies, parts of speech)

http://crr.ugent.be/programs-data/subti ... subtlex-pl


SUBTLEX-UK (British English; Subtitle frequencies, parts of speech)

http://crr.ugent.be/archives/1423


SUBTLEX-PT-BR (Brazilian Portuguese; Subtitle frequencies)

http://crr.ugent.be/programs-data/subti ... btlex-pt-b)


SUBTLEX-PT (Portuguese; Subtitle frequencies)

http://p-pal.di.uminho.pt/about/databases

Google Books Ngrams (n-gram frequencies)

http://ngrams.googlelabs.com/

DLexDB (German, lexical statistics)

http://www.dlexdb.de/

Malay Lexicon project (Malaysian, lexical statistics)

http://brm.psychonomic-journals.org/con ... pplemental


Wordnet (American English; Semantics)

http://wordnetweb.princeton.edu/perl/webwn

MRC Psycholinguistic Database (English; Frequency, Imageability, Phonology, Age of Acquisition, ...)

http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm

Irvine Phonotactic Online Dictionary (English; Phonology)

http://www.iphod.com/


Focused on chronometric variables

English Lexicon project (English; Frequencies, Lexical Decision and Naming Reaction Times)

http://elexicon.wustl.edu/default.asp


Dutch Lexicon project (Dutch; Frequencies, POS and Lexical Decision Reaction Times)

http://crr.ugent.be/dlp

http://crr.ugent.be/idlp/App.html


Chinese Lexicon project (Chinese; character statistics; Lexical Decision Reaction Times)

http://link.springer.com/article/10.375 ... 013-0355-9

Semantic Priming Project (speeded naming and lexical decision data for 1,661 target words, preceded by semantically related and unrelated primes)

http://www.montana.edu/wwwpy/Hutchison/attmem_spp.htm

Form Priming Project: lexical decision data on 420 target words, each of which was presented in 27 different priming conditions

http://www.adelmanlab.org/fpp/


Focused on subjective ratings

Age-of-acquisition (AoA) norms for over 50 thousand English words

http://crr.ugent.be/archives/806

Affective ratings for nearly 14 thousand English words

http://crr.ugent.be/archives/1003

Concreteness ratings for 40 thousand English lemmas

http://crr.ugent.be/archives/1330

http://crr.ugent.be/emlar2015/list%20of ... bases.html
2 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Team Me: Foxing Around

Postby reineke » Sat Jan 15, 2022 5:43 pm

1 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Team Me: Foxing Around

Postby reineke » Sat Jan 15, 2022 6:28 pm

How much input do you need to learn the most frequent 9,000 words?
Paul Nation

Abstract

This study looks at how much input is needed to gain enough repetition of the 1st 9,000 words of English for learning to occur. It uses corpora of various sizes and composition to see how many tokens of input would be needed to gain at least twelve repetitions and to meet most of the words at eight of the nine 1000 word family levels. Corpus sizes of just under 200,000 tokens and 3 million tokens provide an average of at least 12 repetitions at the 2nd 1,000 word level and the 9th 1,000 word level respectively. In terms of novels, this equates to two to twenty-five novels (at 120,000 tokens per novel). Allowing for learning rates of around 1,000 word families a year, these are manageable amounts of input. Freely available Mid-frequency Readers have been created to provide the suitable kind of input needed.

Although it was long assumed that native speakers increase their vocabulary size largely through the strategy of guessing from context rather than from directly being taught vocabulary, it was only relatively recently (Nagy, Herman, & Anderson, 1985) that there was strong experimental evidence that guessing from context was effective and resulted in vocabulary learning.

However, since the early work of West (1955) and later as a result of Krashen's (1985) influential input hypothesis, there has been a strong and growing movement to encourage the use of extensive reading programs for foreign language development (Day & Bamford, 1998; Day & Bamford, 2004; Waring, 2001). However, with one notable exception (Cobb, 2007) there has been no corpus-based study of the feasibility of learning large amounts of foreign language vocabulary through reading. Although reference is made to first language (L1)
learning as evidence for the role of reading, there has also been no L1 study which has taken a corpus-based approach to looking at opportunities for vocabulary learning through reading or listening. Is it possible to learn enough vocabulary just through reading? There has been a debate, with Cobb (2007, 2008) on one side and McQuillan and Krashen (2008).

...whether it is possible to learn enough vocabulary solely through reading input. Cobb argued that given the difficulty of the material and the time available, learners could not get through enough reading to meet the words at each level, let alone have enough repetitions to
learn them. McQuillan and Krashen argued that it is possible, but the major point of disagreement for them involved the difficulty of the reading material. McQuillan and Krashen assumed that learners would be able to read a wide range of texts with relative ease and speed. Cobb argued that the difficulty of the texts with their heavy load of unfamiliar vocabulary would make reading very slow and laboured. There were thus two aspects to their disagreement: (a) the heavy vocabulary load of unsimplified text, and (b) the quantity of input needed to repeatedly meet target words. The first part of the present study temporarily puts aside the vocabulary load issue, and looks solely at the quantity of input needed. So, at first, this article largely ignores the enormous vocabulary load placed on learners when they read and listen to unsimplified texts. It should be noted however that the vocabulary load issue is a very important one that needs to be properly addressed...

The focus of the present study is on the 1st 9,000 words, and because research has shown that the 1st 9000 word families plus proper nouns provide coverage of over 98% of the running words in a wide range of texts (Nation, 2006), a vocabulary size of 9,000 words or more is a sensible long-term goal for unassisted reading of unsimplified texts. Schmitt and Schmitt (2012) also suggested applying the term mid-frequency vocabulary to the 6,000 word families making up the 4th 1000 to 9th 1000 words, because these along with the 3,000 high frequency words of English and proper nouns provide 98% coverage of most texts. An essential condition for learning is repetition, and so learners not only need to gradually meet the most frequent 9,000 word families, but they have to meet them often enough to have a chance of learning them.

Repetition and vocabulary learning

There is clearly a relationship between repetition and vocabulary learning (Elley, 1989; Laufer & Rozovski-Roitblat, 2011; Pellicer-Sanchez & Schmitt, 2010; Stahl & Fairbanks, 1986). The amount of repetition of words typically correlates with the chance of them being learned at around .45 (Saragi, Nation, & Meister, 1978; Vidal, 2011) and is the major factor affecting vocabulary learning from reading (Vidal, 2011). Even though repetition is a very important factor, it is still only one of many factors, and as a result there is no particular minimum number of repetitions that ensures learning. For reading, Vidal (2011) found the greatest increase in learning between two and three repetitions. Webb (2007a, 2007b) found at least 10 repetitions were needed to develop something approaching rich knowledge, but Webb used 10 different tests for each word measuring orthography, association, grammatical functions, syntax, and meaning and form, both receptively and productively, thus requiring a fairly high standard of knowledge. Waring and Takaki (2003) found that at least eight repetitions of a word in a graded reader were needed to have a 50% chance of remembering the word three months later. Recognition after three months is a tough measure, and the scores on the immediate posttest were higher. In this study, the moderately safe goal of 12 repetitions is taken as the minimum.

Because one aim of this study is to resolve the McQuillan and Krashen versus Cobb debate, the main focus is on reading. However, because input for incidental learning can be of many kinds, the study also looks at what kind of input provides the best opportunities for meeting the most frequent 9,000 word families. What kind of reading material provides the best opportunities? Is reading material better than spoken input? Is a mixture of input preferable? This study attempts to answer the following research questions:

1. How much input do learners need in order to meet the most frequent 9,000 word families
of English enough times to have a chance of learning them?
2. Can learners cope with the amount of input?
3. What kinds of input provide the greatest chance of meeting most of the most frequent
9,000 word families?

Method
The present study uses word family lists created from the British National Corpus and the Corpus of Contemporary American English (COCA) to represent learners’ vocabulary sizes.

In order to read without unknown vocabulary becoming too much of a burden, no more than 2% of the running words should be beyond the learners’ knowledge (Hu & Nation, 2000; Schmitt, Jiang, & Grabe, 2011). This means that on average there would be just under 50 words of context around each unfamiliar word, which would allow guessing from context. Native speakers of English appear to increase their vocabulary at the rate of around 1000 word families per year (Biemiller & Boote, 2006; Goulden, Nation, & Read, 1990), with a typical educated native speaker vocabulary size being around 20,000 words. If we expect second language learners to increase their vocabulary at around the same yearly rate, then they will need to increase the amount they read each year, starting for the 2nd 1000 word level at under 200,000 tokens and rising to 3,000,000 tokens a year for the 9th 1000 level. This may be asking too much, as there is no published research to support this figure for learners of English as a foreign language. However, it is an optimistic goal to aim for.

If learners read a total of 3 million tokens, then they would meet the 1st 9,000 words often enough to have a chance of learning them. Spoken sources are of course possible but these provide less intensive input. It takes around two hours to watch a typical 10,000 token movie (a rate of around 83 words per minute, or just over half of a reading rate of 150 words per minute). Nonetheless, an hour to an hour and forty minutes five times a week is possible.

...Unsimplified text clearly provides poor conditions for reading and incidental vocabulary learning for learners whose vocabulary sizes are less than 9,000 word families.

Supporting reading. Graded readers provide suitable reading material up to vocabulary sizes of 3,000 word families.

However, although 98% is a useful minimum coverage figure, when looking at the simplification of text with one of the goals being vocabulary learning, we need to give priority to the actual number of unknown words because we do not want the guessing and look-up load to be too heavy.

Problems with the calculations.

There are some serious problems with the crude calculations used in this study. First, they assume that the input is comprehensible so that learners can learn from it. We know however that learners need a vocabulary size of around 7000 or 8000 words before unsimplified written input is likely to be comprehensible without outside support such as a dictionary (Nation, 2006). Similarly, learners need a vocabulary size of around 6000 words before movies become comprehensible input, although Webb and Rodgers (2009) argue that
3000 words may be sufficient. However, the more vocabulary known, the better comprehension is likely to be (Schmitt, Jiang, & Grabe, 2011).

Second, if texts were written at the right level, the repetitions would increase slightly, because the words beyond the level would be replaced by known words or target words. The increase in repetitions however would only be small, because the words being replaced would be low frequency words of which many would be one-timers.

Third, there is also the problem of actual repetitions (not averages) of the target words. Average repetitions for each 1000 word level have been used, but there is a wide range of repetitions at each level.

Fourth, there is the problem of the spacing of the repetitions. Many words gather enough repetitions by occurring in a range of novels not just in one novel. This means that the spacing between repetitions may be quite large particularly as learners move through the later midfrequency word levels. If the spacing is too large, memory for the previous meeting may disappear before the word is met again.

A positive view. There is however a positive side to the calculations. First, reading at a later level will provide plenty of repetitions for vocabulary met at the earlier levels, so that the reading at a particular level, say the 5th 1000 word level, will not only have the effect of helping the students learn the target words at that level but will also strengthen knowledge of words at the 4th 1000, 3rd 1000 and other levels. What is missed early on can be picked up later.

Laufer (2003) questions whether learners of English as a foreign language in fact learn much vocabulary through reading. Certainly, the reluctance of many teachers to incorporate extensive reading programs in their language courses supports Laufer's scepticism, and thus the opportunities to learn vocabulary through extensive reading are in many places very underutilised. This present piece of research however shows that with the right material such learning is feasible. It is also important to realise that learning vocabulary through extensive reading is just one of a range of opportunities for vocabulary learning, although it can be one of
the most effective and enjoyable opportunities. Research question 3: What kinds of input allow the most words to be met? So far, we have only looked at a corpus of novels, but there are many other kinds of input than novels. Would other kinds of spoken or written input provide better opportunities for meeting the most frequent 9,000 word families? To answer this question, let us look at a range of different
corpora each exactly 2,000,000 running words long.

A diverse corpus is one that is made up of texts from different genres and topic areas. A major distinction between corpora is the spoken/written distinction (Biber & Conrad, 2009). A homogeneous corpus consists of texts which are similar because they are all spoken or all written, or they make up a particular kind of writing, such as novels or academic texts.

If learners want to meet as many different words as possible, should they stick to similar texts or should they get input from a wide variety of different kinds of texts? Table 7 compares the number of word families from the 1st 9,000 word families of English occurring in fourteen different corpora, each 2 million tokens long. There are problems in such a comparison, because there are several factors that affect the richness of vocabulary in a corpus. The spoken/written distinction is clearly an important one (Biber & Conrad, 2009; Shin, 2007). The diversity of topics covered is another very influential factor, with a diverse corpus having a much richer vocabulary than a more homogeneous corpus (Sutarsyah, Nation, & Kennedy, 1994). The degree of formality of the text is likely to be important too, with less formal text (such as letters or friendly conversations) having a less rich vocabulary. In comparisons of various corpora, it is extremely difficult to control for this diversity of variables. It is also necessary to bear in mind that most novels contain a mixture of narrative and dialogue and they should be seen as not truly representing written text in the same way that academic text
does.

As Table 7 shows, a mixed written and spoken corpus provides better opportunities to meet most of the 1st 9,000 word families of English. Six of the top eight two million word corpora in Table 7 contain a mixture of spoken and written texts, if we consider novels to include some spoken text. The top five all contain a journals sub-corpus. It may be the diversity of topics in the journals corpus that results in the richness of vocabulary. Except for the journals only corpus, all the other homogeneous corpora are low in the table. Academic text does not provide the highest inclusion of the top 9,000 word families probably because many technical words are beyond the
9th 1000 level. Spoken corpora alone provide the lowest inclusion.

Using data not shown in Table 7, a BNC spoken only, an ANC spoken only and an ANC plus BNC spoken mix all provided similarly low inclusion. The best advice to learners for vocabulary inclusion might be to read lots of magazines, newspapers and novels, and watch plenty of movies. It is important that this largely positive study of the opportunities for learning through input is
not taken as an argument against deliberate vocabulary learning.

https://files.eric.ed.gov/fulltext/EJ1044345.pdf

How much ER?
The short answer
A book a week at their level.

The long answer
It’s very complex.

http://robwaring.org/how-much-er/
3 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Team Me: Foxing Around

Postby reineke » Sat Jan 15, 2022 7:39 pm





1 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Team Me: Foxing Around

Postby reineke » Sat Jan 15, 2022 11:57 pm

Lawyer&Mom wrote:That was fun!

Percentile per level:
2k: 97
3k: 83
5k: 97
10k: 57

Turns out the benefits in French of a wide English vocabulary are unevenly distributed? It seems fair that I only know 57% of the 10th thousand words, but curious that my knowledge of the 3rd thousand words was so much lower than the 5th thousand words. How did you do?


2k: 97
3k: 100
5k: 100
10k: 97

I took it two times. The first time I screwed up a couple of answers somewhere and I skipped two answers instead of skipping just one. I noticed that I was confusing the boxes - I would put 1 in the first box from the top instead of matching it to the answer. That's just me being distracted but I would still invert the columns. I feel like I want to see more questions although I can see how volunteer test takers could easily give up. I didn't know several of the distractor words. I looked them up and I'd say that most of them can be described as useful. In one case I knew all the distractor words and I could have guessed correctly but I skipped on that question.
"Only" 57%. That's plenty. Nearly 30 percent of English words (in an 80,000 word dictionary) are of French origin. You still need to understand the target pairs and connect the meanings on both sides.
3 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Team Me: Foxing Around

Postby reineke » Sun Jan 16, 2022 3:45 am

Can L1 children's literature be used in the English language classroom? High frequency words in writing for children

A challenge in reading research, and particularly extensive reading research, is how to manage the transition from the top of graded reading schemes to authentic texts which may be separated from each other by up to 5,000 word families. While texts written for native-speaker children have been recommended at times, recent research has shown that the lexical load of these texts was of similar difficulty to that of texts written for adults. In this paper we investigate whether it is possible to identify a specialist high frequency list in writing for children, and the impact of any such list on readability for language learners with a 2,000-word family vocabulary size. We found a list of 245 word families provided almost 3.4% coverage for such learners, thus making the use of L1 children’s literature possible in the English language, and especially the English as a foreign language (EFL), classroom.

Lexical Coverage and Comprehension The importance of such lists for language learning and teaching is that they provide information about lexical coverage (i.e., the percentage of known words in a text) and, indirectly, comprehension.
Nation (2006) estimates this as an 8,000 to 9,000 word family vocabulary for comprehension of written text and a 6,000 to 7,000 word family vocabulary for spoken text (although recent research has found 95% coverage to be sufficient for comprehension of spoken text (van Zeeland & Schmitt, 2013), which would reduce the number of word families needed to be known). It is worth noting that these estimates are considerably higher than had earlier been thought. Previously 95% text coverage had been estimated as sufficient for independent reading comprehension (Laufer, 1989), and Hirsh and Nation (1992) estimated a vocabulary size of 5,000 word families was necessary for reading short novels. It should also be noted, first, that current estimates of vocabulary size may yet be revised (Schmitt, Cobb, Horst, & Schmitt, 2017) and, second, that judgements about vocabulary size and coverage may depend on what is regarded as adequate comprehension. For instance, Laufer and Ravenhorst-Kalovski (2010) suggest 95% coverage is sufficient for minimally acceptable comprehension (with 98% as optimal), and that “the 95% coverage can be achieved by 5,000 word families with proper nouns”
Approaches to Vocabulary Learning If vocabulary is a key to comprehension – it has been described as the single most important predictor of success in reading (Laufer & Sim, 1985) – then a key question in language learning and teaching must be how to develop a vocabulary of sufficient size to allow successful, independent reading.

Indirect, or incidental, learning tends to focus on reading, and extensive reading in particular. However, the vocabulary gains through extensive reading have been shown to be fragile (Waring & Takaki, 2003) and there has been debate as to the extent to which extensive reading alone can meet learners’ vocabulary learning needs

There has also been advocacy of narrow reading, following one topic over several texts, as a means of developing needed vocabulary (Gardner, 2008; Schmitt & Carter, 2000), which has also been translated into specific pedagogical activities (Watson, 2004).

The Importance of Repeated Exposure In any vocabulary learning activity, repetition of the target item is essential. A single encounter with a new word is unlikely to lead to learning its form-meaning connection (Webb, 2007). Research investigating the effect of different levels of repetition has found that a minimum of ten encounters is needed for such learning to occur (Webb, 2007). The Transition from Graded to Authentic Reading There is an issue, however, with developing a vocabulary of sufficient size (i.e., 8,000 or 9,000.

word families for reading comprehension) that cannot easily be addressed by extensive reading of graded readers, which typically have an upper range in the vicinity of 3,000 headwords, or by direct teaching, given the constraints of time. The issue, in other words, is how to learn the many thousands of word families that remain unknown once a learner can read an upper level graded reader successfully and independently. In terms of vocabulary development through extensive reading, one suggestion has been that learners can read authentic children’s literature, that is, texts written for young native speakers (Day & Bamford, 1998; Gardner, 2008; Mikulecky, 2009; Takase, 2009), and it is the case that such materials have been used successfully, including in the classic Fiji book flood (Elley & Mangubhai, 1981, 1983) and in languages other than English (Tabata-Sandom & Macalister, 2009). In a recent corpus-based study, however, Webb and Macalister (2013) concluded that the lexical load of texts written for native-speaker children were of similar difficulty to that of texts written for adults, and that neither was as well suited to extensive reading for language learners as graded readers. Webb and Macalister assumed a vocabulary of the 2,000 most frequent words in their study

A more optimistic view of the transition from graded to authentic reading is offered by Uden, Schmitt, and Schmitt (2014). The optimism is partly based on an estimated smaller word family gap between the two (but still 3,000–4,000 word families; Uden et al., p. 18), and partly based on the results of a small-scale study in which three of four participants “made the jump to the ungraded novels without sacrificing much comprehension, reading speed, or satisfaction” (Uden et al., p. 19). However, these participants were highly motivated readers and their experience is not generalizable to less motivated readers, as the authors of the study themselves noted (Uden et al., p. 20). Thus, given the advocacy of authentic children’s literature, and the need to bridge the vocabulary gap from the upper limits of graded readers to authentic texts, it is worth considering whether a specialized vocabulary for this genre exists and, if it does, whether knowledge of those words would improve the readability of such texts for language learners. In his study, Jenkins (1993) found 216 frequently occurring word families beyond the 1,000 most frequent words of the GSL, and suggested that “as these word families are likely to occur in children’s literature there will be considerable advantage gained by making sure they are known” (p. 108). Macalister (1999) reached a similar conclusion after an analysis of writing for more advanced young readers.

Given the focus of this study, only imaginative prose texts were included in the corpus. This decision was informed by the results of an earlier investigation of a small number of randomly selected imaginative and informative prose texts that suggested that imaginative prose passages would be suitable for extensive reading by second or foreign language learners at least in part because “the unknown words in imaginative prose are more likely to be repeated elsewhere within the corpus … than is the case for unknown words in informative prose passages” (Macalister, 1999, p. 80). This was supported by Gardner (2004, p. 24) who found narrative texts better for incidental vocabulary learning than expository. An informative prose or expository text may include numerous tokens of types specific to that particular content area (cf. findings about the effect of theme on vocabulary repetition in tightly themed expository non-fiction in Gardner, 2008), but non-specialized vocabulary common to story-telling is repeated in multiple imaginative prose texts.

The focus on imaginative prose resulted in a corpus drawn from four years of publication of the
School Journal, comprising 174 texts totaling 128,540 tokens. This compares favorably with some other corpora used to investigate writing for children.

The first question that this study set out to investigate was whether it was possible to identify a list of specialist high frequency vocabulary in writing for children and, perhaps unsurprisingly, a corpus of School Journal imaginative prose texts did indeed yield such a list. The list is presented in Appendix B. Furthermore, the 245 word families can be classified into distinct categories.

The second question driving this study concerned the impact of the list on a second language learner’s ability to read authentic children’s literature. Offering almost 3.4% coverage, the 245 word families offer very substantial benefits in terms of bridging the gap between graded and authentic reading materials, a point that is returned to at the end of this section. This can be seen by considering that similar coverage (3.56%) would be gained by learning the 5,000 word families from the three to seven thousand word level

The study also asked whether this high frequency vocabulary was unique to children’s literature and, as shown in Table 3, the answer was in the affirmative, reinforced by the similarities found with Jenkins (1993). Thus, for language learners wanting to read beyond the upper level of graded readers, the CH HF wordlist offers a clear pathway to successful reading of children’s literature. Indeed, the big difference in coverage between imaginative prose and graded readers (Table 3) shows that these word families will not be learned through reading graded readers, and so learning the word list is essential. Furthermore, learning words from the list may reflect typical L1 vocabulary development, given that they are words likely to be learned early by L1 speakers. All the same, it is the case that 98% coverage remains at the 8,000-word level (Table 2), even if a mere 0.01% prevents 98% coverage at the 7,000-word level

Returning, then, to the need to bridge the vocabulary gap from the upper limits of graded readers to authentic texts, and the contribution that knowledge of a specialized vocabulary can make to improving the readability of such texts for language learners, the CH HF list can clearly make a significant contribution. With knowledge of the CH HF, however, a learner with a 2,000-word vocabulary is close to the 95%, and comfortably meets it if the learner is at the 3,000-level. At the 3,000-word level, then, a learner with knowledge of the CH HF is likely to be somewhere between Laufer and Ravenhorst-Kalovski’s (2010) minimally acceptable and optimal comprehension levels. Remembering that the upper levels of graded reader schemes are typically around the 3,000-headword level, this suggests that authentic children’s literature may be suitable reading material for such language learners. Pedagogical Implications The fact that a 245 word family list provides greater coverage than any lexical frequency band beyond the 2,000-word level, and that it can reduce the vocabulary size needed to achieve 95% coverage in writing for children by a one thousand word frequency band, suggests that the list deserves attention from language teachers. The amount of attention is likely to be affected by the language learning context, whether it is ESL or EFL. It seems intuitively likely that in an ESL setting learners would already be exposed to some of the word types contained in the high frequency list, such as language relating to school. It is in EFL contexts, therefore, that the CH HF list is likely to be most useful.

Conclusion The aim of this paper has been to investigate whether it is possible to make the vocabulary load of reading authentic writing for children manageable for language learners with a 2,000-word vocabulary size. Examination of a corpus of imaginative prose for children has identified a relatively small high frequency list of 245 word families specific to this genre, some of which are likely to be familiar to learners through their inclusion in course books and from being encountered in the learners’ immediate context. Depending on the way in which Not in Any List word families are regarded, this specialist list of high frequency vocabulary in writing for children has the potential to reduce the vocabulary size needed for successful reading of this genre by at least one 1,000-word family frequency band. It is, therefore, likely to make the vocabulary load of reading authentic writing for children manageable for language learners with a 2,000-word vocabulary size, even more so if they are able to read at the upper levels of graded readers successfully. Even more striking, 95% coverage is achieved at the 3,000-word family level, much lower than previous estimates. As a result, the CH HF list has the potential to assist learners’ transition from the upper levels of graded readers to reading authentic texts, a transition that has challenged reading researchers for a considerable time, particularly in the extensive reading field. Given this potential, this specialist list deserves attention in the English language learning classroom.

Writing for Children High Frequency (CH HF) Wordlist

Adjectives angry awesome crazy faint fierce fluffy (word family: fluff) gentle lean neat nervous pale silent (word family: silence) smooth sore sticky thirst wild

Animals & Plants ant bull cage crab creature dragon flea frog goat holly insect jasmine kitten lamb leaf lion mouse paw pet pine pup rat roost seed spider web

Body ache blonde breath cheek stomach throat wrist
Clothing sleeve greatcoat helmet jersey jumper shorts sweatshirt togs towel

Colours silver purple Family cousin grandad nanny papa

Food banana bubblegum coconut cookie honey jelly lemonade lolly mushroom noodle spice watercress House basket blanket broom bucket cushion dishwasher doorway jar ladder lawn lid matchbox oven pillow saucer
Roles burglar (word family: burgle) captain emperor pharaoh pilot princess rabbi soldier vet School bat bench cardboard cricket gang glue lunchtime notebook playground skateboard soccer

Story alien ghost giant magic

Verbs bounce burst carve chew chirp clap crash crawl creep crouch curl dart dive drift drip flash flick fold frown gasp giggle glance glare glitter glow gobble grin groan gulp hiccup hiss hop hug hum illustrate kiss leap lick moan mow mumble mutter nod paddle pause peer poke protest puff rip roar scatter scoop scramble scratch scream shine shiver shove shrug sigh snap sneak sniff spin spray stare steal strap stroke suck surf swallow sweat sweep swing thump tuck wag wail wander whisper wriggle yell zoom

Other balloon bandage beach bead boomerang brand bubble bunch bush concert ditch gum hammer heap hedge hippy hut junk lake liquid ms mud olympic paddock pedal planet puddle reward rope rune shadow shelter spaceship storm string taxi tent tide torch trail trailer trap wart yuk

Appendix A

Word Families beyond the GSL 2,000 High Frequency List Identified by Jenkins (1993)

BNC 2000 words that are present in the School Journal lists but were not among the 2000 high frequency words in the GSL
bang birthday biscuit chase chip chocolate chop enormous foot horrible icecream jacket kid mum naughty ok plate pop (v) sack scared silly tiny trousers vegetable

Words identified by Jenkins with fewer than 10 tokens in School Journal lists
<10 bark canoe claw delicious fiddle flap fox sausage terror trot tug witch wolf
<1 crocodile fairy fuss hare mosquito pumpkin scrap

https://scholarspace.manoa.hawaii.edu/b ... lister.pdf
3 x

User avatar
luke
Brown Belt
Posts: 1243
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 3631

Re: Team Me: Foxing Around

Postby luke » Sun Jan 16, 2022 11:19 am

reineke wrote:In this paper we investigate whether it is possible to identify a specialist high frequency list in writing for children, and the impact of any such list on readability for language learners with a 2,000-word family vocabulary size. We found a list of 245 word families provided almost 3.4% coverage for such learners, thus making the use of L1 children’s literature possible in the English language, and especially the English as a foreign language (EFL), classroom.

That is interesting research. It makes one look forward to the evolution of computational linguistic tools.

Some Tools:
Various corpora (corpuses) that can be selected by categories, authors, domains (speech, written, informal, formal, etc.).
Slider tools one can use to say, "I want to read X and my vocabulary range is Y-Z, so give me the most frequent words in that range and limit the list to N word-families".
Flexible translation tools that generate "word pairs" and optionally add actual contextual usage. E.G., dog = собака ; he's snoring like a dog = он храпит здесь как собака.

Open source would be ideal. How many nerdcicles (cool nerds - or is that redundant?), have personal itches to scratch and would like to or benefit from collaboration? They could be interested in the computation problem, or the "real world application", which is broader than language learning.

Example applications:
"I want to read Crime and Punishment. Give me 200 most frequent words in a vocabulary range of 2k-8k".
"I want to read Dostoevsky. Give me 500 most frequent words in a vocabulary range of 4k-14k".
"I want to read 19th century Russian literature. Give me 1000 most frequent words in a vocabulary range of 6k-36k".
4 x
: 124 / 124 Cien años de soledad 20x
: 5479 / 5500 5500 pages - Reading
: 51 / 55 FSI Basic Spanish 3x
: 309 / 506 Camino a Macondo

Lawyer&Mom
Blue Belt
Posts: 988
Joined: Sun Mar 04, 2018 6:08 am
Languages: English (N), German (B2), French (B1)
Language Log: https://forum.language-learners.org/vie ... =15&t=7786
x 3783

Re: Team Me: Foxing Around

Postby Lawyer&Mom » Sun Jan 16, 2022 6:02 pm

reineke wrote:
Lawyer&Mom wrote:That was fun!

Percentile per level:
2k: 97
3k: 83
5k: 97
10k: 57

Turns out the benefits in French of a wide English vocabulary are unevenly distributed? It seems fair that I only know 57% of the 10th thousand words, but curious that my knowledge of the 3rd thousand words was so much lower than the 5th thousand words. How did you do?


2k: 97
3k: 100
5k: 100
10k: 97

I took it two times. The first time I screwed up a couple of answers somewhere and I skipped two answers instead of skipping just one. I noticed that I was confusing the boxes - I would put 1 in the first box from the top instead of matching it to the answer. That's just me being distracted but I would still invert the columns. I feel like I want to see more questions although I can see how volunteer test takers could easily give up. I didn't know several of the distractor words. I looked them up and I'd say that most of them can be described as useful. In one case I knew all the distractor words and I could have guessed correctly but I skipped on that question.
"Only" 57%. That's plenty. Nearly 30 percent of English words (in an 80,000 word dictionary) are of French origin. You still need to understand the target pairs and connect the meanings on both sides.


Very impressive!

I feel like I know a lot more French vocabulary without even trying than that 30% statistic would suggest. I’ve reading Le Rouge et le Noir right now, and as a former German lit major I’m just in awe of the vocabulary discount with French. It’s so much easier! So much! I’ve always known this, but reading classic lit is just rubbing it in right now. I think the problem with the 30% statistic is that it treats French separate from Latin. Really English is about 60% Romance vocabulary, 30% Germanic, 10% other. Learning French as a native English speaker is cheating, and that’s why I love it.
3 x
Grammaire progressive du français -
niveau debutant
: 60 / 60

Grammaire progressive du francais -
intermédiaire
: 25 / 52

Pimsleur French 1-5
: 3 / 5


Return to “Language logs”

Who is online

Users browsing this forum: No registered users and 2 guests