Routledge Frequency dictionaries — 5k limit

General discussion about learning languages
User avatar
einzelne
Blue Belt
Posts: 804
Joined: Sat Mar 17, 2018 11:33 pm
Languages: Russan (N), English (Working knowledge), French (Reading), German (Reading), Italian (Reading on Kindle)
x 2882

Routledge Frequency dictionaries — 5k limit

Postby einzelne » Wed Jun 15, 2022 6:51 pm

Just curious: does anybody know why Routledge decided to limit its frequency dictionaries to 5000 words? Do they explain their rationale somewhere?

According to Nation, 9k word families are need to cover 98% (at least in English), so why there is no second volumes which would cover the 5k-10k range?
0 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4768
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 14962

Re: Routledge Frequency dictionaries — 5k limit

Postby Iversen » Wed Jun 15, 2022 7:41 pm

5K words is actually more than necessary - above the first 1000 or so words the frequencies are so low and so close that their order is irrelevant. OK then the point could be to give you 5000 words which you definitely should learn before you die, but if you take your hobbies and special competence areas into account then the relevant selection of words for you is bound to differ from whatever the book tells you. For instance I need words for musical instruments and zoology and astronomy and nuclear physics, whereas it would be less logical for me to focus on words for sport and pop culture and modern slang words. I can learn those later (if ever..)
5 x

User avatar
luke
Brown Belt
Posts: 1243
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 3631

Re: Routledge Frequency dictionaries — 5k limit

Postby luke » Wed Jun 15, 2022 7:50 pm

I agree with the other posters, but an additional angle on 5000 words is setup and discussed at 9:09 in Professor Arguelles video on Reading Literature:



We could almost use a poll on the utility or non-utility of frequency dictionaries. :o
0 x
: 124 / 124 Cien años de soledad 20x
: 5479 / 5500 5500 pages - Reading
: 51 / 55 FSI Basic Spanish 3x
: 309 / 506 Camino a Macondo

BeaP
Green Belt
Posts: 405
Joined: Sun Oct 17, 2021 8:18 am
Languages: Hungarian (N), English, German, Spanish, French, Italian
x 1990

Re: Routledge Frequency dictionaries — 5k limit

Postby BeaP » Wed Jun 15, 2022 8:25 pm

It's in the foreword. I don't have the book, but I could read the beginning in the Amazon preview. "Nation has shown that the 4-5000 most frequent words account for up to 95 per cent of a written text." Most people probably wouldn't buy a second volume to make it 98.
4 x

DaveAgain
Black Belt - 1st Dan
Posts: 1968
Joined: Mon Aug 27, 2018 11:26 am
Languages: English (native), French & German (learning).
Language Log: https://forum.language-learners.org/vie ... &start=200
x 4050

Re: Routledge Frequency dictionaries — 5k limit

Postby DaveAgain » Wed Jun 15, 2022 9:16 pm

BeaP wrote:It's in the foreword. I don't have the book, but I could read the beginning in the Amazon preview. "Nation has shown that the 4-5000 most frequent words account for up to 95 per cent of a written text." Most people probably wouldn't buy a second volume to make it 98.
I have the French one (ISBN: 9780415775311), the Nation reference is in the "series preface".
... why there is no second volumes which would cover the 5k-10k range?
For French, Editions Retz publish a 7,000 word dictionary for learners: Diclé:
Ce DICtionnaire pour Lire et pour Écrire (DICLÉ) a été pensé et conçu pour accompagner des apprenants, adolescents et adultes, dans leur appropriation du français écrit. Outil simple d’utilisation, clair et accessible, il comporte environ 7000 mots essentiels du français courant, définis simplement.
1 x

User avatar
einzelne
Blue Belt
Posts: 804
Joined: Sat Mar 17, 2018 11:33 pm
Languages: Russan (N), English (Working knowledge), French (Reading), German (Reading), Italian (Reading on Kindle)
x 2882

Re: Routledge Frequency dictionaries — 5k limit

Postby einzelne » Wed Jun 15, 2022 9:30 pm

BeaP wrote:It's in the foreword. I don't have the book, but I could read the beginning in the Amazon preview. "Nation has shown that the 4-5000 most frequent words account for up to 95 per cent of a written text." Most people probably wouldn't buy a second volume to make it 98.


Thank you for the quote. I thought I read their rationale but couldn't locate it in the text.

STT44 wrote:Cost-benefit.


Why wouldn't they share a list of these words online since they already did all the statistical job and analysis. Just a simple list, without examples and translations would do.

luke wrote:We could almost use a poll on the utility or non-utility of frequency dictionaries. :o


If used appropriately, I found them extremely effective and useful.

DaveAgain wrote:For French, Editions Retz publish a 7,000 word dictionary for learners


Thanks but it doesn't look like a frequency dictionary. They put words in alphabetical order, it ruins the whole idea.
3 x

User avatar
einzelne
Blue Belt
Posts: 804
Joined: Sat Mar 17, 2018 11:33 pm
Languages: Russan (N), English (Working knowledge), French (Reading), German (Reading), Italian (Reading on Kindle)
x 2882

Re: Routledge Frequency dictionaries — 5k limit

Postby einzelne » Thu Jun 16, 2022 1:37 am

Iversen wrote:5K words is actually more than necessary - above the first 1000 or so words the frequencies are so low and so close that their order is irrelevant. OK then the point could be to give you 5000 words which you definitely should learn before you die, but if you take your hobbies and special competence areas into account then the relevant selection of words for you is bound to differ from whatever the book tells you. For instance I need words for musical instruments and zoology and astronomy and nuclear physics, whereas it would be less logical for me to focus on words for sport and pop culture and modern slang words. I can learn those later (if ever..)


I believe that 1k is a huge underestimation. As for domain specific words, I don't think frequency vocabularies are that necessary — you would pick them up organically since they would occur quite often and context would help you understand these words even without dictionary, since you already know the topic in your native language.

Now imagine you would like to develop passable reading skills across different genres of writing. As you're aware, as you go deeper in the list, your chances to review a particular word by natural exposure decrease exponentially. I don't worry to much about first 5k words (if you engage with the language on a regular basis, this core stays with you). But there are other words which educated natives speakers know — depending on different estimates, it's 40k or up to 80k. So I would really appreciate if someone would indicate which 4k out of these 40k-80k would more useful, statistically speaking. (Especially, if you have other languages to take care of).
2 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4768
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 14962

Re: Routledge Frequency dictionaries — 5k limit

Postby Iversen » Thu Jun 16, 2022 6:43 am

The problem is that if you read two texts that each contain X words, the overlap in vocabulary will be limited. Long ago I did some studies on my own contributions to HTLAL. I subdivided my second (2014) corpus into two equal parts, cut the English wordforms therein down to unique headwords and did the statistics below. And I found that even though there was just one author in play here about half of the words in each sub-corpus were specific to that subcorpus, leaving just a third of the total as overlap. Since then I have been very sceptical about lists with just a few thousand words - and my experiences with dictionaries bear this out: for my stronger languages the words I try to find are often not there. Technically speaking you hope to find the lookup-words in the overlap between the dictionary and your source, and looking for the overlap between a tiny dictionary and a source is obviously less efficient than looking for it with a larger dictionary and your source.

So basically my claims are 1) above something like 1000 words the order in the frequency list is totally irrelevant for any specific learner, and 2) you may try to learn at least all the 5K words in a Routledge list just to cover the ground, but you'll still need to supplement them with words from your personal fields of interest, either from specialized literature or - as I do when I do wordlists - from a reasonably big dictionary. And personally I would prefer learning the words from a dictionary where they are listed in alphabetical order than from a list where most of them are listed in a useless order. But best of all: learn the relevant 'special words' by studying texts where they are plentiful and explained in the text - or at least guessable.

My advice: check at an early stage that you know all the most common words (be it 500 or 1000 items, but not more) and then use the remainder of the Routledge book to fill out your holes after you have built a sizeable vocabulary using other methods.

HTLAL-words-used.jpg
You do not have the required permissions to view the files attached to this post.
5 x

User avatar
luke
Brown Belt
Posts: 1243
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 3631

Re: Routledge Frequency dictionaries — 5k limit

Postby luke » Thu Jun 16, 2022 9:29 am

Iversen wrote:My advice: check at an early stage that you know all the most common words (be it 500 or 1000 items, but not more) and then use the remainder of the Routledge book to fill out your holes after you have built a sizeable vocabulary using other methods.

HTLAL-words-used.jpg

Your process and experiments are very interesting. I'm trying to make sense of the 37000 in the overlap column. It is higher than both of the wordforms 2014.1 and wordforms 2014.2 columns. I would think the overlap would have to be less than those two columns individually, since it's the overlap. (The unique headword overlap column follows this expectation).

Can you say something more about the wordforms overlap column?
0 x
: 124 / 124 Cien años de soledad 20x
: 5479 / 5500 5500 pages - Reading
: 51 / 55 FSI Basic Spanish 3x
: 309 / 506 Camino a Macondo

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4768
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 14962

Re: Routledge Frequency dictionaries — 5k limit

Postby Iversen » Thu Jun 16, 2022 12:17 pm

I count wordforms (where not only "is" and "am" count, but also "house" and "houses"), and I count unique headwords - uniqueness is achieved by using a function in MSExcel, but I had to do the clean-up from unique wordforms to headwords manually (i.e. the one that identify not only "house" and "houses", but also "is" and "am" as forms of one headword - but leaving "go" as a verb separate from "go" as a substantive).

The total sum for wordforms (1+2) is 73.172, and then I guess I calculated the overlap by marking all wordforms with 1 or 2 and then checked how many were marked with both 1 and 2 - but I did those calculations several years ago so it seems I have lost them. The situation is different for the unique headforms. Here I first added 1 + 2 with the result 7412, and then then I ran the combined list through the uniqueness thing in Excell again and ended up with the 5433 in the table. And that means that I had 'lost' 1979 items, which must be the overlap.

And how can you reconcile this with the statement that according to Paul Nation the 5000 most common words cover 95% of all words in a standard text? Well, I haven't read the writings of Paul Nation recently so I might be wrong, but the catch could be that the absolutely MOST common words fill up most of the text simply by occurring a lot of times, so if you simply look at the coverage the 95% might be realistic. Whereas I in my statistics haven't cared for the repetitions, but only checked whether a certain headword was represented or not. And then the rare words (those above ½-1 K) will of course dominate.

The 2009 figures are lower because I only had been a member of HTLAL for a short time back then. Those from 2014 cover all my writings in English from the day I joined HTLAL to the day (or days?) I spent on plucking my own posts from the forum and cleansing them for everything in other languages than English and everything written by others.

HTLAL-words-used.jpg
You do not have the required permissions to view the files attached to this post.
3 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: No registered users and 2 guests