SpanishInput wrote:Cainntear wrote:you get to the point where your computers can generate meaningless text that looks natural if you don't attempt to read it if you focus on trigrams, but by the time you hit 5-grams, the system starts spitting out large chunks of the training data verbatim.
Hi, Cainntear! Crazy idea: If old computer systems can get very good at predicting text and even generating text when fed statistics of raw n-grams, wouldn't the same apply to the human brain?
Whether it would or not isn't a particularly useful question, as we already have better ways of training the human brain that give better results. The old n-gram based version of Google Translate (they've ditched n-grams for a deep learning model now -- as I said, computing power is getting cheaper) often failed on long-distance dependencies because it had literally no way of tracking them.
For example, Scottish Gaelic makes use of double negatives in complex sentences for politeness by indirection: "I don't believe he's not wrong" = I think he's right, but you can also say the double positive (which is traditionally seen as too direct and therefore impolite). Gaelic is VSO, so the verb always starts the clause, and if you have more than a few words in the first clause, there is literally no link between the verbs in the first and the second clause unless you use very long n-grams (but that would break the model, as any use of long n-grams results in the system regurgitating extended extracts of the training material). This meant that Google Translate's Scottish Gaelic was no better than tossing a coin in terms of getting the positive/negative polarity of such a sentence right, and half the time it would give literally the opposite meaning of the original Gaelic when translating to English.
Humans can do better than that.
Except that, OK, Google Translate could have done better than that on Gaelic if there was more input data available to give it, because it has handled similar difficulties in other languages before.
Which brings us back to the question of whether humans can do it.
The amount of data in the corpuses computers use for this sort of task is unimaginably vast.
Way back at the beginning of Translate, Google released a corpus of English language data that filled
6 DVDs, which is a lot of plain text. It contained over a trillion words, which is equivalent to over 25000 novels (assuming a word count of 40000 per novel). If you read one novel a day, that would still be 70 years' worth of reading.
That was Google's own dataset, and not only is it likely that they were using other people's data at that point (which wouldn't be in the dataset they released as they didn't own the data), but it is known fact that their dataset has expanded continually since then. Google Translate's n-gram model (which has been abandoned because it was a technological dead-end and had reached the limits of its usefulness) relied on more data than a human could process in their entire lifetime in order to make a passable translation from a relatively easy language pair like EN<->ES, so that's not something we'd want to replicate as humans.
Moreover, unlike computers, human memory is imperfect. The computer gets to compare everything it has ever seen, whereas humans are only likely to extract patterns when they are presented in temporal proximity. That means the human attempting to replicate the machine systems is going to be worse at it and is going to need more time and more data.
I'm no linguist and no computer scientist, just a nerd.
Or in other words, you don't understand what n-grams are. Surely it would be better to
ask us about n-grams rather than
tell us about them, on the grounds that there's a good chance that there's at least one person in a group like this who actually knows what they are and how they work? (And I know I'm not the only one here who has practical experience of working with n-grams.)
But, if a computer can learn to identify a string of sounds as "how to recognize speech" instead of "how to wreck a nice peach" thanks to knowing which sequences of words are more probable, it seems plausible to me that we could train the human brain in a similar way.
The human brain is intelligent and wired for language. A computer is a stupid, brute force machine. Computer language processing is a compromise between trying to model how humans process language and what is feasible given the limitations of our understanding of human language processing and the limitations of computation.
Maybe with enough exposure to how the most common n-grams sound in the real world, with reductions, aspirated /h/ instead of /s/, dropped /d/ and the like... maybe this kind of training would help learners with "speech recognition" in their target language? After all, when we listen to our native language, we're actually doing a great deal of prediction. We fill in the gaps of everything that wasn't properly pronounced.
Yes, and that is the part of language that happens naturally, without requiring much conscious direction.
As others have alluded to in this thread, if you engage with the language you get better at it, and part of that is that your brain naturally adapts to the language you're exposed to, and your brain can do that better when it's dealing with grammatical structures than simple sequences of word tokens, because it can recognise that "no te vayas a dormir" and "me duermo" are in fact the same thing, something that n-grams fail to do.