The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

luke · Postby **luke** » Thu Nov 11, 2021 4:35 pm

AllSubNoDub wrote:I've heard of this chunking/n-gram/whatever method advertised in terms of speaking, but never framed as helpful for listening (and why). I think it's a great idea.

I agree. The list SpanishInput provided is helpful because it's based on frequency of actual content and since it's only 50 items, it's not too daunting. It's also why I think practicing them could be helpful.

Vocal reductions are interesting. I notice that google is often better at picking up some of them than I am

But if a learner can get these n-grams to flow as freely out of his/her mouth in a well understood set of phrases, the typical vocal reductions will start to appear. That should help listening comprehension. E.G. : lo que pasa es que = lokepaseske.

But google is often not as smart when it comes to the bigger context. I.E., it misinterprets a word that seems quite clear to me. For instance, if a speaker drops in an English word like "undergrad", google doesn't realize what the speaker is doing, even though it's apparent from the context.

BeaP · Postby **BeaP** » Thu Nov 11, 2021 6:49 pm

s_allard wrote:All this is based on the laws of Zipf and Pareto that tell us that in spoken language a very small number of elements represent a very large proportion of total usage. We have all seen statistics saying that in spoken English, something like 30 words make up around 50% of all words used. Or with 1000 words, you will cover over 90% of all spoken language. These are not exact figures of course. With today’s technology we can better observe and analyze these phenomena.

The numbers (1000/90) are general or are related to separate idiolects? I wonder how much the overlap is between different people's everyday vocabulary. I sometimes recognise that I use a lot of phrases recurrently, but people around me tend to use others.

Postby **Iversen** » Thu Nov 11, 2021 8:10 pm

BeaP wrote:I sometimes recognise that I use a lot of phrases recurrently, but people around me tend to use others.

That's also my impression. Methinks such differences can be ascribed to differing sources of input - or just to differing stylistic predilictions, which implies some kind of wilful censorship applied to our shared input.

Cainntear · Postby **Cainntear** » Sat Nov 13, 2021 3:01 pm

SpanishInput wrote:@Cainntear: Yup, sometimes these common n-grams can only be used with far more complex structures.

So you agree with my statement... but haven't said anything about my conclusion, which I believe undermines your main point.

I think what you've done here is simplify your message to the point where you weren't saying what you actually believe, which we all do at times, but the problem is that if you're not arguing what you believe, then anyone arguing against you is simultaneously wrong and right...!

SpanishInput wrote:Funny enough, videos by him are what inspired me to create a corpus of subtitles to be able to extract actual data instead of relying on instinct and personal opinions. I had a student who watched his channel and this student kept coming with idioms that I had not heard in my entire life. I showed my student, with data, that those idioms were exclusive to Spain and weren't even that common within Spain.

I agree with you on this (as a general principle -- I'm not familiar with the series in question):
All too often "how people speak" is used as a justification for an arbitrary series of phrases that are of limited use, specific to certain geographies or are just plain out of date.

However, I think the most important misunderstanding is that "how people speak" claims that there's something more important than grammar, when in fact grammar is how people speak. Try going a day in any European language without using the conditional mood or a single subordinate clause... it's not easy!

As stated in the white paper "Bert Cappelle, Natalia Grabar. Towards an n-grammar of English. Constructionist Approaches to
Second Language Acquisition and Foreign Language Teaching, 2016. ffhal-01426700",
"Corpus-based vocabulary teaching prevents certain ‘pet’ expressions in ESL/EFL, such as raining cats and dogs, from being taught too vigorously, and common but less favorite ones, such as right up your (or his, her, etc.) alley, from being ignored altogether."

So more research on n-grams should prevent the problem of teachers and course creators relying too much on their local dialects and personal biases and could lead to the creation of Spanish courses that better reflect a more universal view of the language.

Indeed, and what was missing from your first post is that you're talking about teachers, not about learners. I've got nothing against discussing what teachers should do, but as this is first and foremost a forum for learners, you do need to be explicit when you're talking about teachers.

I must clarify that n-grams are just raw data. Of course no learner should approach them without guidance. Ideally, course creators would use them to inform what to include in the content, particularly in listening exercises. Not necessarily incorporating them as "things to learn", but as things that are just there, in the content of the course.

Exactly -- raw data, and quite a naïve type of data at that.

Because n-grams only capture adjacency relationships, it doesn't properly capture all collocations, because not all collocations are n-grams, at least not in all their realisations.

For example, we can look at English's separable multi-part verbs -- "pick it up", "pick the box up", "pick up (something else)" -- a single collocation that is represented by a near-infinite number of n-grams, and while the bigram is common, it's an understatement of the frequency of the full collocation. ...sorry, while the bigrams are common, because an n-gram model is going to pick up "pick up", "picks up" and "picked up" as different n-grams.

The problem is arguably worse with Spanish verbos pronominales (or in fact any transitive verbs) because now we've got multiple unigrams (cuidarse, cuidarme, etc, cuídate, cuídateme, cuídese, etc) multiple bigrams (me cuido, te cuidas, etc) and multiple longer and rarer n-grams that together still represent a single construct.
For example, No te me vayas a dormir hijita has three words between the reflexive pronoun and the verb it qualifies, and it's still the same construct as "dormirse".

N-grams definitely have their uses (for years, Google Translate got passable results from an approach that was almost entirely based on them!) but it just seems to me that you're overselling them here. They are part of the story in informing teachers in what to teach and in what order, but only a part.

Postby **Iversen** » Sat Nov 13, 2021 3:31 pm

Teachers should definitely be aware of recurring patterns, and basing that knowledge on empiric research would be a good idea. But I have been my own teacher for most of my life so I'm also interested.

Collecting raw data is obviously necessary, but as Cainntear rightly points out, some kind of analysis might be a good idea. The main problems are a) patterns that only differ on the morphological level, b) discontinous patterns. A gathering machine should be able to catch both "take up" and "took up" and "took the box up" and refer them to the same n-gram (or maybe 1+2 to the same n-gram and no.3 to a related one). The first task could be accomplished by a machine that has access to all forms of all common verbs (something like Verbix, but in the form of a database) plus all forms of other words (has to be constructed first). The second task is slightly more complicated because the element between "take" and "up" can be just about any physical object you can imagine - but it is still something that can be done by statistical analysis alone, i.e. by a machine. One problem to take into account during this process would be to keep n-grams of the form A + B separate from both A + B + C and A + (something irrelevant) + B - but even that should in principle be possible to do without human interference. I do however see the need for some human supervision when it comes to separating or not separating e-grams because there may be some semantics involved.

PS: I get sinister connotations from the word "n-gram" - it reminds me of Ron Hubbard's "engrams". :roll:

AllSubNoDub · Postby **AllSubNoDub** » Sat Nov 13, 2021 4:01 pm

Iversen wrote:Teachers should definitely be aware of recurring patterns, and basing that knowledge on empiric research would be a good idea. But I have been my own teacher for most of my life so I'm also interested.

Collecting raw data is obviously necessary, but as Cainntear rightly points out, some kind of analysis might be a good idea. The main problems are a) patterns that only differ on the morphological level, b) discontinous patterns. A gathering machine should be able to catch both "take up" and "took up" and "took the box up" and refer them to the same n-gram (or maybe 1+2 to the same n-gram and no.3 to a related one). The first task could be accomplished by a machine that has access to all forms of all common verbs (something like Verbix, but in the form of a database) plus all forms of other words (has to be constructed first). The second task is slightly more complicated because the element between "take" and "up" can be just about any physical object you can imagine - but it is still something that can be done by statistical analysis alone, i.e. by a machine. One problem to take into account during this process would be to keep n-grams of the form A + B separate from both A + B + C and A + (something irrelevant) + B - but even that should in principle be possible to do without human interference. I do however see the need for some human supervision when it comes to separating or not separating e-grams because there may be some semantics involved.

PS: I get sinister connotations from the word "n-gram" - it reminds me of Ron Hubbard's "engrams".

We're missing the forest for the trees here. If it's not an n-gram, that doesn't mean it's not important. If it didn't fall out of the sieve during the n-gram analysis, that doesn't mean it's not important. Now, if it still falls out of the analysis as extremely common despite all the reasons @Cainntear and @Iversen have pointed out, that means it should be learned today in my mind.

s_allard · Postby **s_allard** » Sat Nov 13, 2021 4:29 pm

BeaP wrote:
s_allard wrote:All this is based on the laws of Zipf and Pareto that tell us that in spoken language a very small number of elements represent a very large proportion of total usage. We have all seen statistics saying that in spoken English, something like 30 words make up around 50% of all words used. Or with 1000 words, you will cover over 90% of all spoken language. These are not exact figures of course. With today’s technology we can better observe and analyze these phenomena.

The numbers (1000/90) are general or are related to separate idiolects? I wonder how much the overlap is between different people's everyday vocabulary. I sometimes recognise that I use a lot of phrases recurrently, but people around me tend to use others.

The figures I guestimated are based on my recollections from papers by Paul Nation. His work on vocabulary size is heavily skewed towards written English but the observations are all the same : a relatively small number of units or what he calls word-families make up the majority of language usage.

People all speak or write a language in a unique manner but there is obviously considerable overlap if people are speaking the same language. The way all these vocabulary size studies work is they take a set of texts or recordings and and sum all the unique words in each text or recording. This is how we end up with these often misleading statistics saying that you need something like 6000 words to read contemporary novels when in fact a given novel may only contain 1500 unique words.

AllSubNoDub · Postby **AllSubNoDub** » Sat Nov 13, 2021 5:26 pm

s_allard wrote:
BeaP wrote:
s_allard wrote:All this is based on the laws of Zipf and Pareto that tell us that in spoken language a very small number of elements represent a very large proportion of total usage. We have all seen statistics saying that in spoken English, something like 30 words make up around 50% of all words used. Or with 1000 words, you will cover over 90% of all spoken language. These are not exact figures of course. With today’s technology we can better observe and analyze these phenomena.

The numbers (1000/90) are general or are related to separate idiolects? I wonder how much the overlap is between different people's everyday vocabulary. I sometimes recognise that I use a lot of phrases recurrently, but people around me tend to use others.

The figures I guestimated are based on my recollections from papers by Paul Nation. His work on vocabulary size is heavily skewed towards written English but the observations are all the same : a relatively small number of units or what he calls word-families make up the majority of language usage.

People all speak or write a language in a unique manner but there is obviously considerable overlap if people are speaking the same language. The way all these vocabulary size studies work is they take a set of texts or recordings and and sum all the unique words in each text or recording. This is how we end up with these often misleading statistics saying that you need something like 6000 words to read contemporary novels when in fact a given novel may only contain 1500 unique words.

Common does not mean easy. The more common a word is, the more meanings it can carry. Speaking of carry, as an example try defining the Spanish word "llevar". It has well over a dozen distinct meanings, some just nuanced differences, others completely unrelated. This becomes less a problem for lower frequency words, which typically only carry one meaning.

This is what 90% vocabulary coverage looks like:
In the morning, you start again. You shower, get dressed, and walk pocklent. You move slowly, half- awake. Then, suddenly, you stop. Something is different. The fribs are fossit. Really fossit. There are no people. No assengles. Nothing. “Where is dowargle?” you ask yourself. Suddenly, there is a loud quapen—a befourn assengle. It speeds by and almost hits you. It vickarns into a store across the frib! Then, another befourn assengle farfoofles. The befourn officer sees you. “Off the frib!” he shouts. “Go home, lock your loopity!” “What? Why?” you shout back. But it’s too late. He is gone.

(added underlines so as not to confuse ESL learners, but the effect is even stronger without them)

Edit: Also, please tell me where 1500 words brings you to 90%? https://www.lextutor.ca/cover/papers/nation_2006.pdf

s_allard · Postby **s_allard** » Sat Nov 13, 2021 9:21 pm

AllSubNoDub wrote:
s_allard wrote:
BeaP wrote:
s_allard wrote:All this is based on the laws of Zipf and Pareto that tell us that in spoken language a very small number of elements represent a very large proportion of total usage. We have all seen statistics saying that in spoken English, something like 30 words make up around 50% of all words used. Or with 1000 words, you will cover over 90% of all spoken language. These are not exact figures of course. With today’s technology we can better observe and analyze these phenomena.

The numbers (1000/90) are general or are related to separate idiolects? I wonder how much the overlap is between different people's everyday vocabulary. I sometimes recognise that I use a lot of phrases recurrently, but people around me tend to use others.

The figures I guestimated are based on my recollections from papers by Paul Nation. His work on vocabulary size is heavily skewed towards written English but the observations are all the same : a relatively small number of units or what he calls word-families make up the majority of language usage.

People all speak or write a language in a unique manner but there is obviously considerable overlap if people are speaking the same language. The way all these vocabulary size studies work is they take a set of texts or recordings and and sum all the unique words in each text or recording. This is how we end up with these often misleading statistics saying that you need something like 6000 words to read contemporary novels when in fact a given novel may only contain 1500 unique words.

Common does not mean easy. The more common a word is, the more meanings it can carry. Speaking of carry, as an example try defining the Spanish word "llevar". It has well over a dozen distinct meanings, some just nuanced differences, others completely unrelated. This becomes less a problem for lower frequency words, which typically only carry one meaning.

This is what 90% vocabulary coverage looks like:
In the morning, you start again. You shower, get dressed, and walk pocklent. You move slowly, half- awake. Then, suddenly, you stop. Something is different. The fribs are fossit. Really fossit. There are no people. No assengles. Nothing. “Where is dowargle?” you ask yourself. Suddenly, there is a loud quapen—a befourn assengle. It speeds by and almost hits you. It vickarns into a store across the frib! Then, another befourn assengle farfoofles. The befourn officer sees you. “Off the frib!” he shouts. “Go home, lock your loopity!” “What? Why?” you shout back. But it’s too late. He is gone.

(added underlines so as not to confuse ESL learners, but the effect is even stronger without them)

Edit: Also, please tell me where 1500 words brings you to 90%? https://www.lextutor.ca/cover/papers/nation_2006.pdf

I’ll let others comment on the rest of this post, and I’ll concentrate on the last line since it refers to something I wrote. Here is a quote from the end of the Paul Nation article quoted above (my bold):

“If we take 98% as the ideal coverage, a 8,000–9,000 word-family vocabulary is needed for dealing with written text, and 6,000–7,000 families for dealing with spoken text. Clearly, spoken language makes slightly greater use of the high-frequency words of the language than written language does. In contrast, we need to consider that text coverage greater than 98% may be needed to cope effectively with the transitory nature of spoken language. The data we have looked at in this article suggest the following conclusions.

1. The greatest variation in vocabulary coverage is most likely to occur in the first 1,000 words, and in the proper nouns. The first 1,000 plus proper nouns cover 78%–81% of written text, and around 85% of spoken text.
2. The fourth 1,000 and fifth 1,000 words provide around 3% coverage of most written text, and 1.5%–2% coverage of spoken text.
3. The four levels of the sixth to ninth 1,000 provide around 2% coverage of written text and around 1% coverage of spoken text.
4. The five levels of tenth to fourteenth 1,000 provide coverage of less than 1% of written text and 0.5% of spoken.”

Now, I’ll admit that Nation does not explicitly say that 1,500 word families gives you 90% coverage but what he says in conclusion 1 seems close enough for me. Furthermore, I’m certainly not saying that 90% coverage is good for anything. Nation, like all observers in the field says that around 98% text coverage is necessary for good understanding.

What I have said, here and on many other occasions, is that a statement like « If we take 98% as the ideal coverage, a 8,000–9,000 word-family vocabulary is needed for dealing with written text, and 6,000–7,000 families for dealing with spoken text “ can be misleading because it gives one the impression that you need 6,000 – 7,000 word-families to understand every given text. This is simply not the case. On page 16 of the very same article mentioned here, Nation writes:

" How many word-families do you need to know to be familiar with most words in a children’s movie?

The popular children’s movie Shrek was chosen for analysis. The script, excluding stage directions, is almost 10,000 tokens long, and uses a total of almost 1,100 word-families "

As you can see, you only need around1,100 word families to understand the movie Shrek.

Speaking of children’s works, I should point out that one only needs only 50 words for 100% coverage of the entire book Green Eggs and Ham by Dr Seuss.

I believe this line of discussion is tangential to this thread. The central idea here is the value of focusing on n-grams or micro-structures as part of an approach to improving understanding (and speaking) spoken Spanish.

AllSubNoDub · Postby **AllSubNoDub** » Sat Nov 13, 2021 10:40 pm

I'm not sure why you're only interested in the last part of my response, it was directed at you.

Anyway, you mentioned literature, not Shrek or Green Eggs and Ham. So you would be at a much lower coverage than what you were able to pick from the article. Also, the graph grows logarithmically as vocabulary grows, not linearly. Therefore, you can't say you'd have 90% coverage from 1500 word families (not actual words, btw) by what you've presented.

This is what 80% coverage looks like, which would much closer to reality for actual novels:
"Bingle for help!” you shout. “This loopity is dying!” You put your fingers on her neck. Nothing. Her flid is not weafling. You take out your joople and bingle 119, the emergency number in Japan. There’s no answer! Then you muchy that you have a new befourn assengle. It’s from your gutring, Evie. She hunwres at Tokyo University. You play the assengle. “…if you get this…” Evie says. “…I can’t vickarn now… the important passit is…” Suddenly, she looks around, dingle. “Oh no, they’re here! Cripett… the frib! Wasple them ON THE FRIB!…” BEEP! the assengle parantles. Then you gratoon something behind you…

This is not comprehensible input imo and trying to extensively read at this level of understanding is a very inefficient way to grow your vocabulary. It can be made more comprehensible, but that's not extensive reading. I agree, we're getting a bit off topic.

A language learners’ forum

The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Who is online