Selecting extensive reading materials

General discussion about learning languages
User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Selecting extensive reading materials

Postby reineke » Sat Nov 04, 2017 6:32 pm

HOW MUCH READING?
The short answer

A book a week at their level.

The long answer

It’s very complex.

https://www.er-central.com/contributors ... h-reading/

How much input do you need to learn the most frequent 9,000 words?
Paul Nation

Abstract
This study looks at how much input is needed to gain enough repetition of the 1st 9,000
words of English for learning to occur. It uses corpora of various sizes and composition
to see how many tokens of input would be needed to gain at least twelve repetitions and
to meet most of the words at eight of the nine 1000 word family levels. Corpus sizes of
just under 200,000 tokens and 3 million tokens provide an average of at least 12
repetitions at the 2nd 1,000 word level and the 9th 1,000 word level respectively. In terms
of novels, this equates to two to twenty-five novels (at 120,000 tokens per novel).
Allowing for learning rates of around 1,000 word families a year, these are manageable
amounts of input. Freely available Mid-frequency Readers have been created to provide
the suitable kind of input needed.

http://nflrc.lll.hawaii.edu/rfl/October ... nation.pdf

Mid-frequency readers
This article describes a new free extensive reading resource for learning the
mid-frequency words of English and for reading well known texts with minor
vocabulary adaptation. A gap exists between the end of graded readers
at around 3,000 word families and the vocabulary size needed to read
unsimplified texts at around 8,000 word families. Mid-frequency readers are
designed to fill this gap.

Table 1: Vocabulary sizes needed to get 98%
coverage (including proper nouns) of various
kinds of texts (Nation, 2006)
Novels 9,000 word families
Newspapers 8,000 word families
Spoken English 7,000 word families
Children’s movies 6,000 word families"

http://eprints.lancs.ac.uk/71637/1/868_4733_5_PB.pdf
Last edited by reineke on Sat Nov 25, 2017 4:12 pm, edited 1 time in total.
3 x

Cavesa
Black Belt - 4th Dan
Posts: 4960
Joined: Mon Jul 20, 2015 9:46 am
Languages: Czech (N), French (C2) English (C1), Italian (C1), Spanish, German (C1)
x 17566

Re: Selecting extensive reading materials

Postby Cavesa » Sat Nov 04, 2017 6:45 pm

LinguaPony wrote:For years I just grabbed any detective novel in English I could get my hands on and sometimes dabbled in other genres. When I started I didn't know anything about the 98% rule, and I'm glad I didn't. The first time I heard about it was a couple of months ago, while listening to one of the Polyglot Conference presentations on YouTube, and my first thought was, "If I had known about it back then, I might never have learned English at all". As it is...

Now, with Italian, I do basically the same thing. Read what I can get, sometimes amuse myself with bilingual reading, but never bother about the 98% thing and never touch any "graded readers" stuff. Give me the real thing, any time!


I wholeheartedly agree and would like to quote a popular Czech saying:
Nikdy neříkej, že něco nejde, protože vždycky se najde nějaký blbec, který neví, že to nejde, a udělá to.

Never say something is impossible. Everytime, there is a moron who doesn't know it's impossible, so he goes and does it.
This time, I am definitely not ashamed to be the proverbial moron. :-D

My first book in English, HP and the Order of the Phoenix, that was definitely way bellow 98%, I was somewhere around A1/A2 and hating English. Who knows what the % was. But I was very motivated, so I improved really really fast. It was so much better to quickly improve thanks to determination, focus, and absorbing everything, than to wait half a year for a translation :-D

I am more and more tempted to sum the whole thread up like "Don't overcomplicate it. Get something you like and find accessible."
12 x

Ольга
Green Belt
Posts: 261
Joined: Sun Sep 13, 2015 10:42 am
Languages: English, French, German, Greek, Portuguese
Language Log: https://forum.language-learners.org/vie ... =15&t=6206
x 196

Re: Selecting extensive reading materials

Postby Ольга » Sat Nov 04, 2017 6:59 pm

reineke wrote:HOW MUCH READING?
The short answer

A book a week at their level.

The long answer

It’s very complex.

https://www.er-central.com/contributors ... h-reading/


'A book a week', I like that! I should obviously try this with my English! :)
3 x
Output Challenge 2018
Hours of Recorded Speech: 0 / 50
Words: 4732 / 50000

User avatar
LinguaPony
Orange Belt
Posts: 141
Joined: Mon Oct 23, 2017 7:50 am
Location: Saratov, Russia
Languages: Russian (N), English (Proficient), Italian (Intermediate), M. Chinese (Beginner), German (Just started), Yiddish (half-cooked A1, long since forgotten, but now queued for revival)
Language Log: https://forum.language-learners.org/vie ... =15&t=7160
x 309
Contact:

Re: Selecting extensive reading materials

Postby LinguaPony » Sat Nov 04, 2017 7:01 pm

Cavesa wrote:Never say something is impossible. Everytime, there is a moron who doesn't know it's impossible, so he goes and does it.


I just love it!
4 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Selecting extensive reading materials

Postby reineke » Sat Nov 11, 2017 3:40 pm

_"_The Choice of Reading Matter. Professor S. H. Goodnight. University of Wisconsin. Bulletin of the Wisconsin Association of Modern Language Teachers, January, 1916.

"In this most timely and interesting article Dr. Goodnight discusses first the general principles which should govern the choice of reading matter in modern language courses. These he sums up in the sentence, “See to it that the reading matter bears upon the life and character of the foreign people and that it is adapted in degree of difficulty to the ability and stage of maturity of the class.” The types of reading matter for a four-year course in German are discussed under these heads: short stories, novels, lyrics, dramas, historical and descriptive prose. There follows a discussion of practical problems, the question of fiction or nonfiction, modern or classical material, prose or poetry, reader or separate text...

One of the foremost institutions in this movement is the Musterschule at Frankfort-on-the-Main. In the program of that school the object of the course in modern languages is defined:

1. Intelligent reading of the whole literary language. 2. Intelligent understanding of everyday speech. 3. Active use of the language up to two or three thousand words.

There must be a sharp distinction between reading and translation. Learning—to read a language is a process of acquiring new symbols for objects and ideas. Translation is a process of comparing two sets of symbols. Class translation is dangerous, for it puts stress on the English, when it should be on the foreign language. Some simple reading should form part of every lesson from the beginning. The recitation should begin with brief questions on required outside work, the questions to be in the foreign language and as far as possible in the words of the text. Select for discussion passages with words and expressions that will be valuable for the pupil to know well. Vary the work in reading. One day assign two or three pages for intensive work in vocabulary and construction, another day five or six pages for reproduction. In the third year of the high school course the class should be able to read with enjoyment and appreciation from the point of view of thought and form, not of vocabulary and grammatical construction. G. L. F."

Modern Language Forum, Volumes 2-3
3 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Selecting extensive reading materials

Postby reineke » Mon Dec 25, 2017 4:44 pm

Vocabulary Range and Text Coverage:
Insights from the Forthcoming
Routledge Frequency Dictionary of Spanish

Mark Davies
Brigham Young University

1. Introduction

"An important question for natural language researchers, general linguists, and even teachers and students is how much text coverage can be achieved with a certain number of lexemes in a given language. In studies such as Nation (2000), we find that the top 1000 lexemes in English account for about 80% of all tokens in a given text. The second block of 1000 lexemes provides coverage for approximately 5% additional coverage of tokens, and this drops to about 3-4% for the third set of 1000 lexemes. These data are important for language learners (and teachers), as they attempt to address the issue of core vocabulary, and how much time and effort should be spent in extending vocabulary beyond a certain level.

While studies of vocabulary coverage have been carried out for other languages (see, for example, Jones 2003), none has been carried out for Spanish. Most likely, the reason for this is that until very recently, we did not have the raw materials upon which to base such a study. In order to provide an accurate model of the Spanish lexicon, we must first have a representative corpus, including texts and transcripts of conversation from a wide variety of genres and registers. These texts must then be accurately annotated for part of speech and lemma. The present study is an overview of how this process has been carried out in the creation of the Frequency Dictionary of Spanish, which will be published by Routledge in 2005.

2. Previous studies of vocabulary frequency in Spanish

There have already been a number of frequency dictionaries of Spanish, and one might suppose that the data from one or several of these would be sufficient to study text coverage with a given level of vocabulary in Spanish. Unfortunately, this is not the case. The most accurate frequency study of Spanish to date is probably Chang-Rodríguez (1964). While it was a notable achievement for its time, it has become somewhat outdated since that time. The corpus on which the frequency data is based is only one million words, and all of it comes from strictly literary works, and solely from peninsular texts.

Because there is no spoken component to the corpus, the vocabulary is highly skewed. For example, the word poeta is word number 309 in the frequency list, with other cases like lector (453), gloria (566), héroe (601), marqués (653), dama (696), and príncipe (737). This skewing is not limited just to nouns, but also includes what would in a normal corpus be much lower frequency verbs, like acudir (number 498 in the complete frequency list), figurar (503), and juzgar (560) and adjectives like décimo (240) and bello (612). Again, the skewing is due more to the inadequate corpus on which the frequency list is based rather than being a product of the general methodology, and it is simply a function of the difficulty in creating large, representative corpora forty years ago. Such is also the case with the Brown corpus of American English, which -- like the corpus used for Chang-Rodríguez -- was based on just one million words from strictly written texts -- and yet which nonetheless remained the standard corpus of English for more than thirty years.

In addition to Juilland and Chang-Rodríguez (1964), there have been a number of other frequency dictionaries and lists for Spanish (Buchanan 1927, Eaton 1940, Rodríquez Bou 1952, García Hoz 1953, Alameda and Cuetos 1995, Sebastián, Carreiras, and Cuetos 2000), but all of these suffer from significant limitations as well. Most importantly, all of the frequency dictionaries are based exclusively on written Spanish, and contain no data from the spoken register. This leads to the type of unrepresentative vocabulary shown above. In addition, five of the dictionaries (Buchanan 1927, Eaton 1940, Rodríquez Bou 1952, García Hoz 1953, Juilland and Chang-Rodríquez 1964) are now quite outdated and are based on texts from the 1950s or earlier. In addition to being based strictly on written Spanish, the two dictionaries that have been produced in the last ten years both suffer from other important limitations. Alameda and Cuetos (1995) only lists exact forms – rather than lemma – and very few of the written texts on which it is based are from outside of Spain. Finally, Sebastián, Carreiras, and Cuetos (2000) exists only in electronic form and is extremely hard to acquire, since it can only be purchased (at least at the present time) directly from the University of Barcelona.

3. Corpus and methodology

The goal, then, has been to create a representative corpus of Spanish, annotate it for part of speech and lemma, and then use this data to examine lexical coverage with varying levels of lexemes...

4 Vocabulary coverage

With the frequency data from the annotated corpus, we were then able to extract lists of the 6000 most frequent lexemes, which will form the basis of the Routledge Frequency Dictionary of Spanish. However, we can also use this same data to examine the issue of text coverage with differing levels of lexemes, which is the focus of this paper. In the following table -- which represents the main conclusions of this study -- we see the percent coverage of all tokens in three different registers (oral, fiction, and non-fiction) at three different levels of lexemes -- top 1000 words, top 2000 and top 3000.

Table 3. Percent coverage of tokens by groups of types/lemma

table 3.png


As the data indicate, a limited vocabulary of 1000 words would allow language learners to recognize between 75-80% of all lexemes in written Spanish, and about 88% of all lexemes in spoken Spanish (which is due to the higher repetition of basic words in the spoken register). Subsequent extensions of the base vocabulary have increasingly marginal importance. By doubling the vocabulary list to 2000 words, we account for only about 5-8% more words in a given text, and the third thousand words in the list increases this only about 2-4% more. There clearly is a law of “diminishing returns” in terms of vocabulary learning.
The [table 4] indicates how the data from Spanish compares to that of Nation (2000) for English and Jones (2003) for German.

table 4.png


The data from Spanish and English are roughly comparable, but there is an important difference in the way in which the data was obtained. In Nation (2000), the words are grouped by what he calls “word families”, so that [courage, discouragement, encourage] would all be grouped under the headword [COURAGE], and [paint, painted, painter, painting] would all be grouped under the headword [PAINT]. In our study, however, we used the traditional lemma approach, in which pintar, pintura, pintor, and pintoresco would all be assigned to different lemma, and [pintamos, pinto, and pintarás] would all be assigned to the lemma [PINTAR]. Because we separate the nominal, verbal, and adjectival uses, we might expect that the same number of headwords would lead to less text coverage than in English. The fact that this does not happen, however, is probably due to the fact that English has a larger lexical stock than Spanish, due to the influence of native Anglo-Saxon and imported Franco-Norman and Latinate words (e.g. real, royal, regal). The fact that the same amount of lexemes in German leads to lower textual coverage is somewhat more difficult to explain. It may be due to the still-incomplete state of the German tagger (Jones, p.c.). Or again, it may be due to a generally larger lexical stock in German than in Spanish, though this is much more debatable."

10. Conclusion

Hopefully the preceding discussion provides some useful insight into the issue of vocabulary range and text coverage, and the way in which the extracted data can be used to create a more useful frequency dictionary of Spanish. From the point of view of a language learner, the important point is that text coverage clearly obeys the law of diminishing returns. With about 4000 words, a language learner would be able to recognize more than 90% of the words in a typical native speaker conversation. If s/he learns two thousand more words, however, this will increase coverage by only about 3-4%. We have also seen that the degree of coverage is a function of register and part of speech, and have provided detailed data to support this view. We have also considered the role of vocabulary range, and how factors such as register affect this as well. Finally, we have briefly considered how this methodology can be interfaced with technology to produce the final output -- an accurate frequency listing of words. Hopefully, this information will be of use not just to linguists and natural language researchers, but to teachers and students alike, who are looking for the most productive way to enhance the acquisition of Spanish vocabulary."

http://www.lingref.com/cpp/hls/7/paper1091.pdf
You do not have the required permissions to view the files attached to this post.
1 x

User avatar
Serpent
Black Belt - 3rd Dan
Posts: 3657
Joined: Sat Jul 18, 2015 10:54 am
Location: Moskova
Languages: heritage
Russian (native); Belarusian, Polish

fluent or close: Finnish (certified C1), English; Portuguese, Spanish, German, Italian
learning: Croatian+, Ukrainian; Romanian, Galician; Danish, Swedish; Estonian
exploring: Latin, Karelian, Catalan, Dutch, Czech, Latvian
x 5179
Contact:

Re: I choose extensive reading with listening in mind

Postby Serpent » Tue Dec 26, 2017 4:42 pm

coldrainwater wrote:I speak [too] slowly and therefore so does that voice subvocalizing as I read. I figured extensive reading doesn't have to start so slowly and I do not have to make the compromise of intentionally squashing subvocalization to read faster. Picking an audiobook removed a chunk of my own self-imposed mental torpor and kept me at a sustainable native speed. For the curious, it roughly doubles my reading rate from what I can tell.
I also find that audiobooks help a lot. I read slowly even in Russian, and I often reread sentences needlessly :(
1 x
LyricsTraining now has Finnish and Polish :)
Corrections welcome

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Selecting extensive reading materials

Postby reineke » Thu Mar 22, 2018 1:54 am



Is it Possible to Learn Enough Vocabulary from Extensive Reading?
2 x

Stefan
Green Belt
Posts: 379
Joined: Sun Dec 20, 2015 9:59 pm
Location: Sweden
Languages: -
x 920
Contact:

Re: Selecting extensive reading materials

Postby Stefan » Thu Mar 22, 2018 8:28 pm

I apologize if this has been said already but Jeff McQuillan wrote an interesting summary (pdf) on Nation.

The other key assumption made by Nation – that 12 exposures to an unknown word are sufficient to acquire the word – is based on previous studies that produced differing estimates both above and below that figure. In Pellicer-Sanchez and Schmitt (2010), for example, unknown words that occurred at least 10 times in the text were acquired 80% of the time, as measured by a meaning recognition test (Table 1, p. 41). In Waring and Takaki (2003), at least 15 repetitions were required for a similar level of success (72%).

Nation (2014) analyzed a corpus comprised of 25 novels taken from Project Gutenberg (http://www.gutenberg.org). He also provided estimates of how long it would take a reader to read that amount of text, assuming a reading speed of 150 words per minute.

As shown in Table 1, one would need to read approximately 11,000,000 words to reach the 9,000-word-family level, and that this feat would take about 1,200 hours to complete. At one hour per day, this represents a little over three years of reading, very doable for a motivated adult or adolescent acquirer.


The science seems a bit shaky but that's a lot of reading.

Speaking of the comparison between English and German above, aren't those numbers a bit irrelevant due to compounds and how you end up with Rhababerbarbarabarbarbarenbartbarbierbier?
1 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Selecting extensive reading materials

Postby reineke » Thu Mar 22, 2018 9:36 pm

"experience changes the quality of lexical representations, and does so differently for different words and different individuals. Some aspects of this relationship are well-described, including the logarithmic relationship between word frequency of occurence and behavioral correlates of word recognition: ten exposures to an infrequent word may have a similarly strong impact on the quality of that word’s mental representation as 100 exposures to a word that is well entrenched in one’s mental lexicon...
Importantly, it may not be simply the number of exposures to a word – larger for good readers, smaller for poor ones, due to their differences in reading experience – that would give rise to individual variability. It may be that poor readers are not able to use the exposures they do get to create the kind of high quality lexical representations that skilled readers have.. .

For example, readers who make fewer phonological discriminations due to poor phonological processing skills will not end up with the same quality of lexical representation after 100 exposures than someone without phonological problems would end up with, even if their level of reading experience is matched. The same holds true for readers with a limited learning capacity or a compromised long-term lexical memory, or any other behavioral or organic characteristic that impedes the entrenchment of mental lexical representation: in all these cases the readers would have to have a larger number of exposures to a word than readers without those characteristics to create a representation of the same quality. "

https://forum.language-learners.org/vie ... 22&p=97879
1 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: nathancrow77 and 2 guests