The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

luke · Postby **luke** » Sat Nov 13, 2021 11:41 pm

AllSubNoDub wrote:Anyway, you mentioned literature, not Shrek or Green Eggs and Ham.

This is what 80% coverage looks like, which would much closer to reality for actual novels:

One of the interesting things Professor Arguelles mentions in his lecture on reading is that the cumulative effect of 80% or 90% or 95% comprehension is that one gradually gets lost as the novel continues. His observation of not liking a certain book by Umberto Eco in Italian when he thought it should be interesting was that his word coverage at the time was insufficient (below 98%).

You made another good point about the many meanings of high frequency words.

I'm going to go out on a limb here and suggest the reason one may not understand spoken Spanish even though they've learned it for years is they haven't left their comfort zone for long enough to understand spoken Spanish. It's not easy and it's not always clear. Does one not like certain content because it doesn't reflect much of the things they value, or because of the sometimes contradictory ways certain common words are used, or lack of experience with vocal reductions, or is semantic, is it n-grams?

Postby **Iversen** » Sat Nov 13, 2021 11:53 pm

AllSubNoDub wrote:This is what 80% coverage looks like, which would much closer to reality for actual novels:
"Bingle for help!” you shout. “This loopity is dying!”. (...)

The problem with this kind of argument is that the supposedly rare words often are international technical or scientific or cultural terms (at least in non fictional works), and then they may against all statistically founded odds be comprehensible. This lexical quirk still won't take you to the coveted 98% coverage, but it takes you closer to the goal.

Cainntear · Postby **Cainntear** » Sun Nov 14, 2021 12:01 am

Iversen wrote:A gathering machine should be able to catch both "take up" and "took up" and "took the box up" and refer them to the same n-gram (or maybe 1+2 to the same n-gram and no.3 to a related one).

Those aren't n-grams, though -- n-grams are specifically a string of n consecutive tokens. What you're talking about is just plain old collocations, which are far more useful for human analysis, but a lot more work for automated systems to attempt to detect.

n-grams are just a computationally cheap option, so they're still occasionally used where the size of the task means more accurate tools are not viable, something which is true in fewer and fewer situations as computing power gets cheaper and cheaper.

In fact, n-grams are essentially the oldest tool in computational linguistics -- they're basically a direct development of Russian mathematician Andrey Markov's work on trying to statistically analyse the probabilities of letter sequences in Pushkin's poetry. Markov's big thing, Markov chains, is about the probability of certain events occurring in sequence without analysing the whole sequence, only the most recent events.

If you use Markov chains with whole word tokens (instead of the individual letters he used) and you use a decent sized training set (far more data than Markov could have processed by hand) you get to the point where your computers can generate meaningless text that looks natural if you don't attempt to read it if you focus on trigrams, but by the time you hit 5-grams, the system starts spitting out large chunks of the training data verbatim.

SpanishInput · Postby **SpanishInput** » Sun Nov 14, 2021 2:59 am

Cainntear wrote:you get to the point where your computers can generate meaningless text that looks natural if you don't attempt to read it if you focus on trigrams, but by the time you hit 5-grams, the system starts spitting out large chunks of the training data verbatim.

Hi, Cainntear! Crazy idea: If old computer systems can get very good at predicting text and even generating text when fed statistics of raw n-grams, wouldn't the same apply to the human brain?

I'm no linguist and no computer scientist, just a nerd. But, if a computer can learn to identify a string of sounds as "how to recognize speech" instead of "how to wreck a nice peach" thanks to knowing which sequences of words are more probable, it seems plausible to me that we could train the human brain in a similar way. Maybe with enough exposure to how the most common n-grams sound in the real world, with reductions, aspirated /h/ instead of /s/, dropped /d/ and the like... maybe this kind of training would help learners with "speech recognition" in their target language? After all, when we listen to our native language, we're actually doing a great deal of prediction. We fill in the gaps of everything that wasn't properly pronounced.

BeaP · Postby **BeaP** » Sun Nov 14, 2021 7:41 am

luke wrote:I'm going to go out on a limb here and suggest the reason one may not understand spoken Spanish even though they've learned it for years is they haven't left their comfort zone for long enough to understand spoken Spanish. It's not easy and it's not always clear. Does one not like certain content because it doesn't reflect much of the things they value, or because of the sometimes contradictory ways certain common words are used, or lack of experience with vocal reductions, or is semantic, is it n-grams?

I totally agree with the first part of your statement. I started to improve my listening comprehension by watching series. In Spanish with Spanish subtitles. Listening and reading at the same time helped me to learn how to pair up the heard sound sequences with vocabulary elements in my head. After a while I understood every movie or TV series, so I ventured out and started to watch talk shows (not scripted) and listen to podcasts (no visual clue, mouth movement). And now I'm looking for channels totally unrelated to language learning that people make for native audience about their hobbies. I search for channels run by intelligent people who think very quickly and speak very quickly. Even natives don't have the same language skills. Some are better models than others for an advanced learner: have a wider vocabulary and are able to make language jokes. There are some dialects, mainly form the south of Spain, that are quite difficult for me to understand. I don't think that the answer to this problem is a complicated, linguistic one. This is a progress with continuous development. There is no such thing as 'understand spoken Spanish' in general for the majority of learners. It's the end of the road. Understand what? Understand whom? These are the questions you need to ask yourself. And yes, one has to come out of the comfort zone, and look for new types of content after a while. But it requires work and research.

Cainntear · Postby **Cainntear** » Sun Nov 14, 2021 11:26 am

SpanishInput wrote:
Cainntear wrote:you get to the point where your computers can generate meaningless text that looks natural if you don't attempt to read it if you focus on trigrams, but by the time you hit 5-grams, the system starts spitting out large chunks of the training data verbatim.

Hi, Cainntear! Crazy idea: If old computer systems can get very good at predicting text and even generating text when fed statistics of raw n-grams, wouldn't the same apply to the human brain?

Whether it would or not isn't a particularly useful question, as we already have better ways of training the human brain that give better results. The old n-gram based version of Google Translate (they've ditched n-grams for a deep learning model now -- as I said, computing power is getting cheaper) often failed on long-distance dependencies because it had literally no way of tracking them.

For example, Scottish Gaelic makes use of double negatives in complex sentences for politeness by indirection: "I don't believe he's not wrong" = I think he's right, but you can also say the double positive (which is traditionally seen as too direct and therefore impolite). Gaelic is VSO, so the verb always starts the clause, and if you have more than a few words in the first clause, there is literally no link between the verbs in the first and the second clause unless you use very long n-grams (but that would break the model, as any use of long n-grams results in the system regurgitating extended extracts of the training material). This meant that Google Translate's Scottish Gaelic was no better than tossing a coin in terms of getting the positive/negative polarity of such a sentence right, and half the time it would give literally the opposite meaning of the original Gaelic when translating to English.

Humans can do better than that.

Except that, OK, Google Translate could have done better than that on Gaelic if there was more input data available to give it, because it has handled similar difficulties in other languages before.

Which brings us back to the question of whether humans can do it.

The amount of data in the corpuses computers use for this sort of task is unimaginably vast.

Way back at the beginning of Translate, Google released a corpus of English language data that filled 6 DVDs, which is a lot of plain text. It contained over a trillion words, which is equivalent to over 25000 novels (assuming a word count of 40000 per novel). If you read one novel a day, that would still be 70 years' worth of reading.

That was Google's own dataset, and not only is it likely that they were using other people's data at that point (which wouldn't be in the dataset they released as they didn't own the data), but it is known fact that their dataset has expanded continually since then. Google Translate's n-gram model (which has been abandoned because it was a technological dead-end and had reached the limits of its usefulness) relied on more data than a human could process in their entire lifetime in order to make a passable translation from a relatively easy language pair like EN<->ES, so that's not something we'd want to replicate as humans.

Moreover, unlike computers, human memory is imperfect. The computer gets to compare everything it has ever seen, whereas humans are only likely to extract patterns when they are presented in temporal proximity. That means the human attempting to replicate the machine systems is going to be worse at it and is going to need more time and more data.

I'm no linguist and no computer scientist, just a nerd.

Or in other words, you don't understand what n-grams are. Surely it would be better to ask us about n-grams rather than tell us about them, on the grounds that there's a good chance that there's at least one person in a group like this who actually knows what they are and how they work? (And I know I'm not the only one here who has practical experience of working with n-grams.)

But, if a computer can learn to identify a string of sounds as "how to recognize speech" instead of "how to wreck a nice peach" thanks to knowing which sequences of words are more probable, it seems plausible to me that we could train the human brain in a similar way.

The human brain is intelligent and wired for language. A computer is a stupid, brute force machine. Computer language processing is a compromise between trying to model how humans process language and what is feasible given the limitations of our understanding of human language processing and the limitations of computation.

Maybe with enough exposure to how the most common n-grams sound in the real world, with reductions, aspirated /h/ instead of /s/, dropped /d/ and the like... maybe this kind of training would help learners with "speech recognition" in their target language? After all, when we listen to our native language, we're actually doing a great deal of prediction. We fill in the gaps of everything that wasn't properly pronounced.

Yes, and that is the part of language that happens naturally, without requiring much conscious direction.

As others have alluded to in this thread, if you engage with the language you get better at it, and part of that is that your brain naturally adapts to the language you're exposed to, and your brain can do that better when it's dealing with grammatical structures than simple sequences of word tokens, because it can recognise that "no te vayas a dormir" and "me duermo" are in fact the same thing, something that n-grams fail to do.

s_allard · Postby **s_allard** » Sun Nov 14, 2021 1:01 pm

AllSubNoDub wrote:I'm not sure why you're only interested in the last part of my response, it was directed at you.

Anyway, you mentioned literature, not Shrek or Green Eggs and Ham. So you would be at a much lower coverage than what you were able to pick from the article. Also, the graph grows logarithmically as vocabulary grows, not linearly. Therefore, you can't say you'd have 90% coverage from 1500 word families (not actual words, btw) by what you've presented.

This is what 80% coverage looks like, which would much closer to reality for actual novels:
"Bingle for help!” you shout. “This loopity is dying!” You put your fingers on her neck. Nothing. Her flid is not weafling. You take out your joople and bingle 119, the emergency number in Japan. There’s no answer! Then you muchy that you have a new befourn assengle. It’s from your gutring, Evie. She hunwres at Tokyo University. You play the assengle. “…if you get this…” Evie says. “…I can’t vickarn now… the important passit is…” Suddenly, she looks around, dingle. “Oh no, they’re here! Cripett… the frib! Wasple them ON THE FRIB!…” BEEP! the assengle parantles. Then you gratoon something behind you…

This is not comprehensible input imo and trying to extensively read at this level of understanding is a very inefficient way to grow your vocabulary. It can be made more comprehensible, but that's not extensive reading. I agree, we're getting a bit off topic.

I didn’t respond to the first part of the previous part because I simply couldn’t understand the relevance to the thread, especially the example of 90% text coverage. Ditto for the post at hand here and the example of 80% coverage.

So, instead of heading off on wild-goose chases, I suggest we keep our eyes on the ball and focus on learning strategies for improving our comprehension of spoken Spanish. In this regard, it seems to me that the central question of the thread is whether this idea of using computer-generated n-grams from a corpus of Netflix Spanish as a learning tools is of any value. I understand that some people can get hung up on the computational model and and want to argue that to death. I prefer to look at the results, as simple and incomplete as they may be, and see if there is any thing of utility.

I happen to think that all this is an old idea, jazzed up with new terminology and new technology. Foreign language learning and polyglottery have been happening centuries before the rise of corpus linguistics, computational linguistics and cognitive linguistics. What the OP is proposing here is a formalization of something I believe all good language learners have been doing intuitively.

In the very first post SpanishInput gives us for starters 43 high-frequency five-word n-grams that I prefer to call micro-structures. One sees immediately that these units of meaning are made up of a small number of high-frequency words. You notice for example that words like lo, la, el, que, a, nada are used repeatedly in different contexts. Some high-frequency verbs are there in their most common conjugated forms : ser, estar, ir, pasar, tener, hacer, dar, decir, etc. You notice that these word forms are in a specific order in each phrase. Some words have endings that are linked to other words as part of a syntax system. In other words, there are patterns here.

If you master the underlying grammar and the lexicon of the examples given here you will have acquired a basic understanding of how spoken Spanish works. At this point, I know some readers will think that I’m about to claim, as I’m wont to do, that 43 phrases are all you need to speak Spanish. No, for heaven’s sake I’m not saying that. Obviously there is a lot more to understanding spoken Spanish than just these phrases. What I will say however is that these phrases that you will hear often pack a lot of useful content on which to build.

I also want to emphasize the importance of acquiring large amounts of content words or sheer vocabulary through massive and diverse input.

Cainntear · Postby **Cainntear** » Sun Nov 14, 2021 5:52 pm

s_allard wrote:In this regard, it seems to me that the central question of the thread is whether this idea of using computer-generated n-grams from a corpus of Netflix Spanish as a learning tools is of any value. I understand that some people can get hung up on the computational model and and want to argue that to death. I prefer to look at the results, as simple and incomplete as they may be, and see if there is any thing of utility.

It's not a matter of being "hung up" on anything, it's that the term is technical jargon and has a very specific, precisely defined meaning. If we discard that precise meaning and take the term to be a synonym for another long-established idea, then the term becomes valueless, because we already have perfectly good words to talk about that other idea.

I happen to think that all this is an old idea, jazzed up with new terminology and new technology.

Then you are categorically wrong. This is a relatively recent statistical language processing concept that is deliberately an approximation of the idea from linguistics that you would prefer to talk about -- collocation.

What the OP is proposing here is a formalization of something I believe all good language learners have been doing intuitively.

I agree with this to a reasonable extent.
But the problem is that "bad" language learners are the ones that don't have the intuition or insight that the good ones do, so there's definitely value in identifying the things that good ones do intuitively in order that others may consciously mimic it.

If you master the underlying grammar and the lexicon of the examples given here you will have acquired a basic understanding of how spoken Spanish works.

Absolutely, which is why this is something that needs to be looked at by a teacher who's already familiar with the language, not the student who's about to start learning, because this sort of analysis isn't going to tell us what the patterns are, but highlight their relative importance.

I also want to emphasize the importance of acquiring large amounts of content words or sheer vocabulary through massive and diverse input.

Exactly, and playing around with concordancers takes time away from doing that.

(Edit: formatting error)

A language learners’ forum

The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Re: The importance of n-grams or why you still don't understand spoken Spanish even though you've learned it for years

Who is online