CEFR Levels and vocabulary size

General discussion about learning languages
Cavesa
Black Belt - 4th Dan
Posts: 4960
Joined: Mon Jul 20, 2015 9:46 am
Languages: Czech (N), French (C2) English (C1), Italian (C1), Spanish, German (C1)
x 17566

Re: CEFR Levels and vocabulary size

Postby Cavesa » Wed Oct 25, 2017 10:12 am

Iversen wrote:I have spent a fair amount of time on counting my words in different languages, and I have taken my precautions against some of the pitfalls you may experience with this kind of activity. For instance I don't count word families - I count headwords because that's what you can find in dictionaries. I have generally skipped proper names because some dictionaries have them, others don't, but for instance in English it would be logical to count "Leghorn" for Italian Livorno because it definitely isn't the form used in the source language. My counts would however sudddenly swell if I now decided to include them. And if my English dictionary prints a word combination in bold then I include it, but not if it is printed with the same types as idiomatic expressions. I have in some cases tried to count expressions (separately, of course), but it was cumbersome, boring and pointless, given the very different profile of different dictionaries. So I dropped the iea, and after many word counts I now have a fairly consistent set of rules of thumb which I can apply across languages.

But there is one trick more - and it is probably the most important one: except in my earliest counts I have always included percentages. You can argue than there are more words in a big dictionary and consequently you should be able to tick more words, but apart from the pitifully tiny and the colossal dictionaries the percentages seem to be surprisingly consistent across dictionary sizes and languages. And I do feel that there is something substantial in the claim that I knew 36% of the words in my Afrikaans Prisma dictionary in 2009 and 66% in 2014. But unfortunately it is also a fact that scores can vay wildly with no apparently reason, like when I got 77% and 68% with two different Dutch dictionaries in 2014 - and the lowest percentage was found with the smallest one of the two dictionaries. Maybe I was just tired, maybe the small dictionary has fewer 'international' loanwords, or maybe it was just statistical hapax.

So I know how wary you have to be with vocabulary estimates. But my conclusion is nevertheless clear: vocabulary size IS relevant. My capability to read confidently with only few holes is directly proportional to the percentage of words I know in a language, and I have to know at least half the words from a standard size dictionary to be able to read without feeling strained - and two thirds is better. If I just know some 20-25 % (i.e. 8-10.000 known words out of 40.000 headwords in a typical midsize dictionary) I prefer having a dictionary within reach, unless I'm dealing with a bilingual printout.

But there are three caveats. The first is that reading can be more or less extensive, and if I'm just skimming a newspaper article I may not need to to know the meaning of all the abbreviations and institution names. If I'm studying a text intensively then every word should be understood, and it is a much more serious business not to know a certain word.

The second caveat is just knowing a word may not be enough if it is used in an idiomatic expression.

And finally word counts say very little about your ability to express yourself - especially at the lower levels. I have discussed this a lot with s-allard here and in HTLAL, and the funny thing is that I basically agree with him: knowing 2500 words well is more useful than knowing 25000 words purely passively. And if your native conversation partners aren't too dim they will know that they have to adapt to your dismally low level and not use words or constructions that are too difficult. The argument is NOT that you only get the chance to use 163 headwords in a given conversation - the relevant factor is that these words have been selected from a well-rehearsed subset of words from your target language. And surviving on a subset consisting of 2500 headwords is not unrealistic.


I actually wonder whether anyone counts words differently than counting headwords, which are in a dictionary. I somehow cannot imagine anyone counting a word family as one word, or the opposite, counting a conjugated verb as a dozen words. The only exception, which could confuse some of the beginning learners, could be tools like readlang, which add any word you click at as a flashcard as it is, and then just show you the number of flashcards.

Yes, the collosal dictionaries are bound to be more precise. I should definitely try your counting method sometime soon.

To the last part: I don't think there has ever been anyone saying the opposite, that you'd need many thousands words to express yourself at the lower levels. The argument that was reapeatedly so wrong was claiming that a C2 learner doesn't need that many words, just to know the small bit really really well. 2500 or 3500 is a ridiculously small amount of words for a C1 or C2 learner, even if you can use each of them in a dozen ways. In order to choose the appropriate vocabulary, you need a larger pool, and the choice of appropriate vocabulary is being judge during the language exams too. Both a B1 and C2 learners can only get the chance to use 163 headwords in a given conversation, but it should be obvious the C2 learner is choosing them from a much larger pile of options, taking the most appropriate one from it.

Surviving on a subste of 2500 headwords is not unrealistic, I've never said the opposite nor can I remember anyone else saying it. But living like a normally intelligent and educated person is unrealistic. It's just like the difference between surviving on bread and water and hoping to be well alimented and long-term satisfied on such a diet.

I also wouldn't go for the extreme of saying that a small amount of actively known words is more important than a large amount of passively known ones. We need both, the natives are not going to use our tiny vocabulary. And any interaction, spoken or written, consists of production and reception. And out of the two, comprehension can be more tricky and much more important. I am not pointing at anyone in person here (definitely not Iversen), but it sometimes seems like many learners are much more concerned with their ability to share their genious thoughts with others, who should be grateful for them no matter how hard they are to understand due to the butchering of the language, than with the ability to listen and perfectly understand the thoughts of others.
7 x

User avatar
LinguaPony
Orange Belt
Posts: 141
Joined: Mon Oct 23, 2017 7:50 am
Location: Saratov, Russia
Languages: Russian (N), English (Proficient), Italian (Intermediate), M. Chinese (Beginner), German (Just started), Yiddish (half-cooked A1, long since forgotten, but now queued for revival)
Language Log: https://forum.language-learners.org/vie ... =15&t=7160
x 309
Contact:

Re: CEFR Levels and vocabulary size

Postby LinguaPony » Wed Oct 25, 2017 11:22 am

Cavesa wrote:I actually wonder whether anyone counts words differently than counting headwords, which are in a dictionary.


There are various online tools, which use complex extrapolation algorithms to estimate one's vocabulary size. I realise that they are wildly inaccurate, but they give a general idea - at least, one can always use them to compare one's vocabulary with the next guy's. I use those.

20000 words for C2 was always my idea of it, more or less. 4500 or 5000 English words is not enough. When I knew that many, I could barely get the idea of the plot in a not very complicated book. And I'm quite sure of the number, because I had just completed the computer game I was using to learn words, and it contained exactly 5000 of them.

That's not C2.
5 x

aaleks
Blue Belt
Posts: 884
Joined: Thu Apr 13, 2017 7:04 pm
Languages: Russian (N)
x 1910

Re: CEFR Levels and vocabulary size

Postby aaleks » Wed Oct 25, 2017 1:55 pm

Cavesa, I agree with everything you wrote here but with one exception:
Cavesa wrote:I also wouldn't go for the extreme of saying that a small amount of actively known words is more important than a large amount of passively known ones. We need both, the natives are not going to use our tiny vocabulary. And any interaction, spoken or written, consists of production and reception. And out of the two, comprehension can be more tricky and much more important. I am not pointing at anyone in person here (definitely not Iversen), but it sometimes seems like many learners are much more concerned with their ability to share their genious thoughts with others, who should be grateful for them no matter how hard they are to understand due to the butchering of the language, than with the ability to listen and perfectly understand the thoughts of others.

I probably am an extreme example, so to speak :) , but it just happened that I wasn't working on my productive skills till the beginning of this year. I could read, watch tv, had a big evough (for a non-native living in non-English-speaking country) vocabulary, but couldn't produce anything coherent. And to be honest, my first attempt was a complete failure :mrgreen: . And even though there's a noticeable improvement in my writing I'm butchering the language now and again :| . I think that a text written by someone with smaller vocabulary but better command of English in general would be more readable. I've actually seen such examples.
1 x

User avatar
Serpent
Black Belt - 3rd Dan
Posts: 3657
Joined: Sat Jul 18, 2015 10:54 am
Location: Moskova
Languages: heritage
Russian (native); Belarusian, Polish

fluent or close: Finnish (certified C1), English; Portuguese, Spanish, German, Italian
learning: Croatian+, Ukrainian; Romanian, Galician; Danish, Swedish; Estonian
exploring: Latin, Karelian, Catalan, Dutch, Czech, Latvian
x 5179
Contact:

Re: CEFR Levels and vocabulary size

Postby Serpent » Thu Oct 26, 2017 7:08 pm

aaleks wrote:And even though there's a noticeable improvement in my writing I'm butchering the language now and again :| . I think that a text written by someone with smaller vocabulary but better command of English in general would be more readable. I've actually seen such examples.
Honestly in my opinion that's mostly the Russian obsession with the grammatical accuracy, as well as the fairly significant differences between the sentence structure in Russian and English. I wouldn't refer to your minor mistakes as butchering, and I think if a non-native is easier to understand than you, it's probably because their L1 is more similar to English (specifically a Germanic language or French, perhaps also something like Bulgarian). They may also be good at keeping it simple, in their L1 too.
6 x
LyricsTraining now has Finnish and Polish :)
Corrections welcome

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: CEFR Levels and vocabulary size

Postby reineke » Sat Nov 04, 2017 4:41 pm

---
Last edited by reineke on Fri Dec 27, 2019 4:05 am, edited 1 time in total.
3 x

Inst
Orange Belt
Posts: 128
Joined: Thu Feb 07, 2019 9:43 pm
Languages: English (Primary), 普通话 (Mainland Mandarin Chinese, B2)
x 101

Re: CEFR Levels and vocabulary size

Postby Inst » Thu Feb 07, 2019 9:48 pm

I'm curious as to why people assume that a single number can account for all languages. Say, for instance, a standard Chinese undergraduate is mentioned to know around 22,000 words and should know around 3500-5000 characters. This is different from a comparable English speaker who should know 42,000 words or so.

From a similar perspective, wouldn't languages like French or German require smaller lexicons of fluent native speakers? And when it comes to fluent native speakers, what level of education should we set as a litmus test? If you're professionally functional, i.e, have a basic vocabulary comparable to native speakers with a secondary school education, as well as a specialized vocabulary for your work, how is that different from fluent, unless you want to discount people, on a class basis, from being fluent in their native language?
0 x

Cavesa
Black Belt - 4th Dan
Posts: 4960
Joined: Mon Jul 20, 2015 9:46 am
Languages: Czech (N), French (C2) English (C1), Italian (C1), Spanish, German (C1)
x 17566

Re: CEFR Levels and vocabulary size

Postby Cavesa » Thu Feb 07, 2019 11:52 pm

Inst wrote:I'm curious as to why people assume that a single number can account for all languages. Say, for instance, a standard Chinese undergraduate is mentioned to know around 22,000 words and should know around 3500-5000 characters. This is different from a comparable English speaker who should know 42,000 words or so.

From a similar perspective, wouldn't languages like French or German require smaller lexicons of fluent native speakers? And when it comes to fluent native speakers, what level of education should we set as a litmus test? If you're professionally functional, i.e, have a basic vocabulary comparable to native speakers with a secondary school education, as well as a specialized vocabulary for your work, how is that different from fluent, unless you want to discount people, on a class basis, from being fluent in their native language?


No, why?

When I am looking in a smaller dictionary around 30000-50000 words, I see stuff the natives normally use. Even in a larger dictionary than that. Sure, they may not use all those words all the time, but they certainly use them whenever appropriate. 22000 words in a european language sound really weird and too little (but I cannot tell about the non european languages)

Why should huge languages like French or German be poorer in vocabulary than English? This argument could fit dying languages that serve very few purposes these days. Dead languages, that mostly stopped evolving a few hundred years ago. But not huge languages that serve all the purposes in the lives of the natives and in all the areas of human activity, including literature and science.

I think people tend to underestimate the vocabulary of the natives with not that high education. If you have the same vocabulary as a native with secondary school, you are definitely not bad at all. Even without a university degree, people need tons of vocabulary.

Careful with the word "fluent". It is really vague and doesn't have that much in common with vocab, I'd say. It is about fluidity, not complexity. And while comparing oneself to the natives makes sense sometimes, it is tricky. The CEFR levels (nor the word "fluent") are not meant to be used for the natives.

Yes, there are natives with poor vocabulary (usually people who don't like to read. hey, there are even medicine students, who watch their vocabulary outside of the field worsen considerably. I've heard it from several people, and they all thought it was mostly due to lack of reading of normal books). But you cannot say it is just about their official education. And there are also natives, who will have poorer vocabulary than language learners even at the lower level, usually as a part of some neurological pathology. Really, it is tricky to draw parallels.
2 x

Inst
Orange Belt
Posts: 128
Joined: Thu Feb 07, 2019 9:43 pm
Languages: English (Primary), 普通话 (Mainland Mandarin Chinese, B2)
x 101

Re: CEFR Levels and vocabulary size

Postby Inst » Fri Feb 08, 2019 12:42 am

French and German are synthetic languages, and as such, might be expected to have a lesser smaller lexicon than analytic languages. By extension, you could assume the number of words known by an educated French speaker would be less than of an educated English speaker.

And from casual searching, the number that seems to come up is 30,000, whereas I'm getting a number of 42,000 for English. My personal standard would be half native, i.e, you'd be able to have a passive vocabulary equal to half the native speaker vocabulary of a reasonably high education level. In most languages, this would be somewhat higher than the C2 standard for a given language.

Another data point is that Googling seems to suggest that the C2 standards for French and German are about 5000 words / word-stems. TOEFL, in comparison, is around 8000, as is JLPT N1, which suggests that synthetic languages have smaller common vocabularies than analytic languages.

===

As for native speakers with lower formal education, if you read up about the Chinese HSK, even native speakers know it's not C2. They refer it to such as comparable to the requirements of a lower secondary education (as well as minimum literacy requirements for Chinese speakers), i.e, an upper secondary graduate would know 40-100% more Sinograms than the HSK C2 level, and by HSK's roughly 1:2 ratio between characters and words, 40-100% more words. So from this paradigm, we can deduce that vocabulary is linked to, but not necessarily determined by, the level of education. This is then a useful reference frame; in discussions of necessary French vocabulary, there are differing levels of vocabulary size linked to education level, such as natives who get by with a 5000 word lexicon and others who do 30% of Le Petit Robert.
0 x

User avatar
Querneus
Blue Belt
Posts: 836
Joined: Thu Dec 01, 2016 5:28 am
Location: Vancouver, Canada
Languages: Speaks: Spanish (N), English
Studying: Latin, French, Mandarin
x 2269

Re: CEFR Levels and vocabulary size

Postby Querneus » Fri Feb 08, 2019 9:32 am

Inst wrote:French and German are synthetic languages, and as such, might be expected to have a lesser smaller lexicon than analytic languages. By extension, you could assume the number of words known by an educated French speaker would be less than of an educated English speaker.

And from casual searching, the number that seems to come up is 30,000, whereas I'm getting a number of 42,000 for English. My personal standard would be half native, i.e, you'd be able to have a passive vocabulary equal to half the native speaker vocabulary of a reasonably high education level. In most languages, this would be somewhat higher than the C2 standard for a given language.

Another data point is that Googling seems to suggest that the C2 standards for French and German are about 5000 words / word-stems. TOEFL, in comparison, is around 8000, as is JLPT N1, which suggests that synthetic languages have smaller common vocabularies than analytic languages.

I have a strong suspicion of this assertion and these numbers, thinking they might reflect different cultures and methodologies among people (researchers) who care about counting the words known by natives than anything else, on the basis of my intuition that English and French/German have almost exactly the same level of analysis-synthesis anyway.

Shouldn't the logic go the other way actually? An analytic language uses phrases of distinct words where a more synthetic language uses derivational affixes on a word (e.g. English go up vs. French monter, English cherry tree vs. French cerisier), so the number of headwords in more synthetic languages should be higher than in more analytic ones. The one thing that would make English natives carry more headwords than French ones is precisely the synthetic side of English: technical vocabulary is largely formed synthetically using morphemes borrowed from Latin, Greek and French!

In fact, I do remember coming across a paper of a linguist who studied the "minutes" (records) of discussions in the Canadian territorial government of Nunavut, carried out in Inuktitut, obviously a highly synthetic ("polysynthetic") North American indigenous language. The language is so, so synthetic that the proportion of words used only once within a document was far, far, far higher than in English, as the weight of meaning is carried out by derivational affixes attached to common words, creating word forms (theoretically potential dictionary headwords) that speakers sometimes already all know, but more commonly are completely novel just like how a sentence in English is novel. This is well known about highly synthetic languages of course, but what surprised the linguist was that the rate of appearance of words that would occur only once never really flattens in Inuktitut texts the way it does in English, i.e. the proportion of words used only once in Inuktitut does not get any lower in a document/corpus of 100K words compared to one of 10K words, unlike similar English documents/corpora.
3 x

Inst
Orange Belt
Posts: 128
Joined: Thu Feb 07, 2019 9:43 pm
Languages: English (Primary), 普通话 (Mainland Mandarin Chinese, B2)
x 101

Re: CEFR Levels and vocabulary size

Postby Inst » Fri Feb 08, 2019 11:14 am

TBH, it's just my bad memory. The way I do recall was the comparison between vocabulary requirements for C2 French vs C2 English (TOEFL), with the latter being significantly larger.

The way I assumed the synthetic / analytic difference worked was that a synthetic language usually had a complex grammar with many inflections. For instance, nouns in English are gendered rarely, and the gender plays little grammatical effect. French, in contrast, has gender on its nouns, and someone on this board claimed that verbs have up to 30 forms, although many fall into verb classes. German has three genders, in contrast, with four different cases. What I thought the end effect would be would that it'd be harder for native speakers to expand their vocabulary; not only would they need to grasp the meaning, pronunciation, and sometimes orthography of the new word, they would also need to learn the grammatical features of the new word.

A better illustration might be Chinese, although it is analytic. A Chinese speaker unfamiliar with a word incorporating unfamiliar characters (typical character recognition is between 3500 and 5000 characters) would have to guess the meaning from context, same as with any other language, but the Chinese speaker would also need to guess the pronunciation. While analytic, it seems as though Chinese vocabularies are relatively sparse in terms of "words", because if they pick up a word from conversation, they'd also need to learn how to write it, and if they pick up a word from written text, they'd also need to figure out how to pronounce it.

Put this another way, it's noted that speakers of East Asian languages often have earlier mathematics acquisition than speakers of European languages, given that European languages often retain the remnants of vigesimal systems. That is to say, the systemic acquisition of number names occurs slower in speakers of European languages because they need to learn that tenty-one is not a word, while eleven is, and twenty-eleven is not a word either. From the same logic, needing to learn the necessary inflections for words in their native language (and if you say it's trivial, please note the profusion of "bad grammar / spelling" blogs in English), slows down the total rate of vocabulary acquisition.

But this is just a conjecture, anyways. I can't get clear and definitive data on the following topics:

-Rates at which children learn words/word families up to adulthood in a given language
-Vocabulary size of young adults in a given language, varying by education level
0 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: No registered users and 2 guests