Routledge Frequency dictionaries — 5k limit

einzelne · Postby **einzelne** » Fri Jun 17, 2022 6:18 pm

BeaP wrote:Also, I wouldn't draw final conclusions from a study of German textbooks that only includes books written by academics at (mostly) American universities.

See the original study. Deutsch heute, Neue Horizonte and Kontakte are all German only textbooks made in Germany.

BeaP wrote:Maybe there are no 10k lists, because the difference between the frequency of each element is very small (there would be 2000 words at the same place), and it's affected by the type of the texts so much that it can't be done in a scientific way.

Indeed, there's a long tail when all the words have pretty much the same weight. Still, for pedagogical purposes, I think it would be useful to know the number of occurrences in the corpus and I believe you could still do a targeted list for a specific domain (newspapers or fiction, for instance).

For English, there's the COCA project and I don't see the reason why there shouldn't be a similar thing for other major languages.

Ideally, it would be great to have some sort of build-in option in your reading app: you mark a specific topic, or a specific set of books your planning to aspire to read in your target language and every time you use the pop-up dictionary, it will give the number of occurrences of this word in your own corpus. But, given the copyright issues, I doubt we will see anything like that in any foreseeable future. But, who knows, may be one day Amazon will become a monopolist and will be able to do with eBooks whatever it wants :twisted:

SpanishInput · Postby **SpanishInput** » Fri Jun 17, 2022 10:02 pm

Hey, Einzelne.

I just wanted to clarify that I'm actually in favor of frequency lists/dictionaries, as can be inferred from my other threads. And the fact that I have purchased two of Routledge's frequency dictionaries is as testimony of this. I was just playing a little bit of devil's advocate in my previous post. In one of my videos I've also mentioned the fact that some beginner's courses show a complete lack of focus on common vocabulary. For example, one course bombards the student with lots of uncommon cognates just for the sake of telling a story. Meanwhile, extremely common word combinations such as "lo que", "el que", "la que", "¿cómo así..." and "Y si..." go unnoticed by students.

Now, coming back to your original post:

Even though there's no curated, lemmatized wordlist (Let's face it, Routledge's frequency "dictionaries" are actually padded wordlists) beyond 5K, if you're mostly interested in literary Spanish there's the CREA frequency list. You can download it here:

https://corpus.rae.es/lfrecuencias.html

It is not lemmatized, so you might see the same verb lots of times while going through it, but you could use it as a quick way to check if a word you encounter in your readings is worth adding to your study list or not. For example, you could set your "cutoff point" at 20K.

I'm also attaching my 20K list gathered from Netflix data here. It doesn't come from a big corpus, but research has shown that sorting words by contextual diversity instead of frequency is far more important than corpus size. In one study, a list ordered by CD beat a list ordered by frequency when it comes to correlation to lexical decision times, even though the list ordered by frequency came from a corpus that was 10000 times bigger. Another reason why the smaller corpus beat the bigger one is that the smaller one came from movie/TV subtitles, while the bigger one came from websites. Routledge's dictionary comes mostly from written Spanish. Only a small fraction comes from "habla culta" (educated speech), that is, transcriptions of sermons and lectures, and an even tinier fraction actually comes from conversations, but still within the "culto" register.

Of course my list is just a "version 1" and I'm already gathering more data for version 2, especially to expand the number of different contexts, but I'll still keep my focus on spoken Spanish.

einzelne · Postby **einzelne** » Sat Jun 18, 2022 3:27 pm

SpanishInput wrote:Of course my list is just a "version 1" and I'm already gathering more data for version 2, especially to expand the number of different contexts, but I'll still keep my focus on spoken Spanish.

Thank you for sharing! I really appreciate this.

s_allard · Postby **s_allard** » Sun Jun 19, 2022 9:31 pm

The old-timers from HTLAL know that I think that my interest in vocabulary is at the other end of the spectrum of the numbers discussed here. In other words (no pun intended), I believe people only use very small numbers of word in ordinary speech. Writing, especially fiction, will probably require more vocabulary but nothing like those huge numbers bandied around here.

My starting point is the now old observation that 38 word lemmas account for about 50% of all the words of a corpus of modern written and spoken French. These are primarily grammar function words like the three key verbs avoir, être, faire, the articles, pronouns and prepositions.

Before we start of even speak of vocabulary size, we have to make the distinction between productive and receptive vocabulary. To make things very simple, let’s say that your productive vocabulary consists of those lemmas that you have used in the last six months.

You receptive vocabulary consists of those lemmas that you recognize and whose meaning you think you know.

How large is your productive vocabulary ? Disregarding complications such as idioms and formulaic language and using just lemmas, we would probably observe that in both our spoken and written language, we use quite small vocabularies. For example in all your posts here how many different lemmas have you used in the last six months ? I bet that nobody here with the exception of Iversen has ever attempted to do this.

This is an extremely tedious task but just a little sample of a few texts will demonstrate that we tend to actually write and speak with just a few hundred words. Forget about how many words you think you know and could use. Count just the words that you have actually used. For example, I have not used the words knee, ankle, Achilles heel, tonsils and many parts of human anatomy in the last six months. They are part of my potential vocabulary but I don’t use them.

In Spanish I regularly use the words pletórico and patidifuso, to the amusement of my Mexican tutors because these words are considered somewhat rare. But they are part of my productive vocabulary.

What we also know is that we tend to use common words in many different contexts and therefore different meanings, which is why they are so common. For example, there are around 80 more or less different uses of the verb faire in French. Similarly, a word like coup can be used in countless contexts with meanings that have nothing to do with each other.

The point of all this is that what you need above all is a solid grounding in the most common core grammatical forms and some very basic vocabulary. Then you expand your vocabulary as you need to. Maybe you’ll need 500 or 1000 or 2000 unique words. In my opinion, and this is just one opinion, trying to study a list of the 5000 most common lemmas of the target language is not very efficient.

That’s just for productive vocabulary. As for receptive vocabulary, that will have to wait for another post.

Le Baron · Postby **Le Baron** » Sun Jun 19, 2022 10:11 pm

s_allard wrote:The point of all this is that what you need above all is a solid grounding in the most common core grammatical forms and some very basic vocabulary. Then you expand your vocabulary as you need to. Maybe you’ll need 500 or 1000 or 2000 unique words. In my opinion, and this is just one opinion, trying to study a list of the 5000 most common lemmas of the target language is not very efficient.

That’s just for productive vocabulary. As for receptive vocabulary, that will have to wait for another post.

I'm pretty much on board with that: core grammar framework, though perhaps more than just basic vocabulary, yet not thousands and thousands of words in the beginning. You left the door open regarding receptive vocabulary and yet I think that c'est là que le bât blesse. The necessity to understand what other people are saying a lot of the time is greater. It matters less if one's own production is more rudimentary, even if some sophistication feels more satisfactory.

It will matter what you want to do, but even if a person wasn't planning to just read novels and watch TV, but rather to interact with TL speakers, they still need to understand those speakers and all of these have different vocabularies around that basic core. Every single one of them commonly larger and richer in different ways than the TL 2nd-language user. Which is why natives of different levels can still interact pretty seamlessly whilst employing different registers, but the TL user can crash and burn on just a few slang words. In fact that's where a vocabulary including the common slang words can help a great deal.
It might be fair to add that when diverse native speakers interact it is possibly achieved by shrinking down to the more shared core vocabulary. Which comes down in favour of your position I think.

All that said I still agree with you that much can be achieved with fewer materials than is assumed. The bit below surprised me though:

s_allard wrote:For example, I have not used the words knee, ankle, Achilles heel, tonsils and many parts of human anatomy in the last six months. They are part of my potential vocabulary but I don’t use them.

Anatomical words tend to turn up quite a bit. 'Potential vocabulary' may well mark a tangible difference though. The very fact you could recognise and use them (or the equal words in any other language) lifts your functionality and participation in a language.

SpanishInput · Postby **SpanishInput** » Sun Jun 19, 2022 10:29 pm

Le Baron wrote:The necessity to understand what other people are saying a lot of the time is greater. It matters less if one's own production is more rudimentary, even if some sophistication feels more satisfactory.

Yup. Most courses seem to be focused on "Imagine you're an American man trying to say something to a foreign woman" and less on "Imagine you're trying to figure out what on Earth the foreign woman is saying to you". Sadly, courses created in English and then translated to several different languages following the same basic dialogues /story seem the norm.

BeaP · Postby **BeaP** » Mon Jun 20, 2022 5:11 am

SpanishInput wrote: Most courses seem to be focused on "Imagine you're an American man trying to say something to a foreign woman" and less on "Imagine you're trying to figure out what on Earth the foreign woman is saying to you". Sadly, courses created in English and then translated to several different languages following the same basic dialogues /story seem the norm.

I think courses are bad or mediocre because it's hard to make a really good one that also brings you money. We know much more about language learning than one would suspect just by looking at the available resources. A textbook that develops oral comprehension well needs to include a lot of audio materials with good exercises (not a CD with 30 minutes of recordings) and a direct play option from the digital version. It hardly ever happens, because making or buying such material is expensive, developing and maintaining the technology is expensive. So what we get is the bland, uninspired 'copy-paste textbooks', easy and cheap to produce.

Le Baron · Postby **Le Baron** » Mon Jun 20, 2022 11:32 am

BeaP wrote:I think courses are bad or mediocre because it's hard to make a really good one that also brings you money. We know much more about language learning than one would suspect just by looking at the available resources. A textbook that develops oral comprehension well needs to include a lot of audio materials with good exercises (not a CD with 30 minutes of recordings) and a direct play option from the digital version. It hardly ever happens, because making or buying such material is expensive, developing and maintaining the technology is expensive. So what we get is the bland, uninspired 'copy-paste textbooks', easy and cheap to produce.

The thing is though (and I think Spanishinput was referencing Pimsleur?) they have spent a lot of research time and money on that and they are primarily taking the position of 'what to say', a little bit more than 'what to understand'. Not entirely though, since they do also focus on understanding whatever the other person is saying, and you can only put so much into a course. However for the most part you need to understand a lot more than you can say, or at least be able to quickly parse sentences more complex than you might be able to construct yourself as output.

luke · Postby **luke** » Mon Jun 20, 2022 3:33 pm

Le Baron wrote:
BeaP wrote:We know much more about language learning than one would suspect just by looking at the available resources. ... So what we get is the bland, uninspired 'copy-paste textbooks', easy and cheap to produce.

(and I think Spanishinput was referencing Pimsleur?) However for the most part you need to understand a lot more than you can say, or at least be able to quickly parse sentences more complex than you might be able to construct yourself as output.

Just trying to understand here. Ms. BeaP, you're saying that language learning materials is often about marketing and sales, rather than producing effective products, right?

On the Pimsleur angle, as well as how much one needs to understand, versus what one needs to say, I'm reminded of a mature polyglot who worked for an NGO who praised Pimsleur and other things she found effective. On Pimsleur, for someone working in a powerful NGO, being able to do some small talk was combined (probably) with the ability to present positions that the NGO thinks favorable to the audience. In that particular circumstance, the need seem to be a bit like what Taleb would call a "barbell strategy". I.E., On one side, can make people feel comfortable in a brief encounter, and can deliver a comprehensive talk on what NGOs policy, suggestions, or strategy. E.G., not necessarily a lot of give and take in conversation. I'm not at all suggesting this individual couldn't handle all aspects of her Target Languages, but rather having full competence in the areas that were necessary for her job was what really mattered.

Where I'm going with that is, she probably didn't need to understand popular television, conversation in a night club, TL fiction, etc.

In the context of the thread, Frequency Dictionaries do seem tilt more in the direction of "non-fiction". E.G., news, reports, popular science, etc, as opposed to flowery fiction.

Le Baron · Postby **Le Baron** » Mon Jun 20, 2022 3:50 pm

luke wrote: E.G., not necessarily a lot of give and take in conversation. I'm not at all suggesting this individual couldn't handle all aspects of her Target Languages, but rather having full competence in the areas that were necessary for her job was what really mattered.

Where I'm going with that is, she probably didn't need to understand popular television, conversation in a night club, TL fiction, etc.

Though she does need to broadly know how receivers of messages provided in X-language expect the messages to be constructed. This is the same problem as when someone learns a language, then goes to speak to the natives and meets the infuriating spectacle of: "excuse me? What did you say?".

So I'd say that for a L2 speaker their comprehension generally always exceeds their output ability, and that this is normal and perhaps desired. That as time moves on the output gets closer to the comprehension ability, but always moves in that direction; not comprehension towards output. I wouldn't want anyone to misinterpret this, because I think it's good to start communicating when you can with what you have: which will be increasing comprehension and some output ability.

The frequency dictionary is probably one tool to assist with word recognition; since they are designed to make one familiar with commonly-encountered words in as many contexts as possible without being 5000 pages long, or as Badger said: just a dictionary.To implant some recognition for comprehension rather than providing you with some kind of ready arsenal for output. Something which has a learning curve, rather than being just provided and used in the field.

A language learners’ forum

Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Re: Routledge Frequency dictionaries — 5k limit

Who is online