A couple of suggestions to Google Translate from a heavy user

General discussion about learning languages
User avatar
Iversen
Black Belt - 4th Dan
Posts: 4782
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 15020

A couple of suggestions to Google Translate from a heavy user

Postby Iversen » Sat Aug 13, 2016 11:19 pm

Slide001_godawdo.jpg

At the polyglot gathering in Berlin (May 2016) I did a speech about Google Translate, but as far as I can see there won't be a wholesale publishing on Youtube of videos with speeches this year so I have decided to write something about the topic here, using the some of the illustrations I used in Berlin.

When anybody mentions Google Translate it is mostly to criticize and make fun of it, and it has to be said that it is fairly easy to do that, but consider the facts: it has more or less pushed any other translation system on the internet out of existence, it adds new languages faster than any human learner can keep up with, and some of its translations are actually as good as human made translations. We have to live with the beast, so my goal here is to discuss a few things that could make it better.

To my mind - and I'm quite serious here - the methods of Google Translate represent a paradigm shift of the magnitude of the introduction of historic linguistics by the brothers Grimm, Rask and others, the insistance on synchronic and structural language descriptions by people like Saussure, and the Chomskyan revolution that forced linguists to think in terms of deep structure and transformations. It has certainly had an impact on the world of ordinary language users which match that of academic linguistics, and that in itself is a reason to take it seriously.

My personal reason to take Google translate deeply seriously is that I use it to make bilingual texts for intensive study - always from a weak target language to a language I know well, never in the other direction. Isn't GT too unreliable for that? Well, maybe, but in my view there is no real alternative. Ideally I would want interlinear bilinguals based on hyperliteral translations, as in the small language guides by Assimil, Kauderwelsch and a number of linguistic monographs, but my favorite sources aren't published with such translations. And literary translations are sometimes written in a way that suggests that the translator would have preferred to write a whole new book instead of translating the old one. Even official translations from the EU and FN and other organisations are full of arbitrary inconsistencies.

I do not advise anybody to use GT to produce texts in languages they don't know. The reason that I dare use it to produce bilingual texts for closer scrutiny is that I normally can see when it has committed a gross blunder - precisely because I use it to translate from a weak language into a better known one. But to correct an unreliable translation you need to know the destination language well, and then you would spend your time better more by writing the text from scratch.

So how does GT translate?

When I was young the internet didn't exist yet, but the idea of machine translations wasn't new. A rule and dictionary based system called systran had been developed already in the 1940s, and when I learnt about Chomsky and his transformation grammar I was certain that this system would eventually be used for machine translations. Actually I have always had my doubts about its suitability for human language learning, but it seemed to be just the right thing for computers.

So it came as something of a shock when Google around 2007 switched from Systran to something called ‘statistical machine translation’. The central figure in this was an Austrian research scientist named Franz Ochs, who already before his employment at Google had expressed his belief that statistics in the long run would produce better results than rulebased translations systems.

I imagine that a bunch of Google employees sat down at a table to discuss. Somebody said:

Houston, we've got a problem. Chomsky promised us generative grammars in 1957, but there is not a single one for a complete language yet - not even for English, and it would take us forever to have humans develop them for 117 languages or more PLUS their combinations. And Chomsky told us to write NP VP before even thinking of which words the sentence is going to contain. We are on a collision course with reality.

And then someone, maybe the coming leader Ochs (until 2014), said:

OK, let's skip formal grammar. If somebody says that two sentences have the same meaning, then they are probably right, and it's not our business to find out what those sentences mean. But there must be parallels between elements in those sentences that mean the same. Skip academic linguistics! Let's start a collection of translations and let statistical algorithms find those parallels.

And that's what they did, even though it was a brave move into the unknown.

And that's why I say that Google Translate has brought about a new paradigm that has preciously little to do with anything that went before it. But that doesn't mean that it is perfect.
You do not have the required permissions to view the files attached to this post.
11 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4782
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 15020

Re: A couple of suggestions to Google Translate from a heavy user

Postby Iversen » Sat Aug 13, 2016 11:23 pm

Let's have a look at a non-fictional example, namely the booklet you got in 2015 when buying the Brussels Card. Below you can see a few sentences in the three languages used in this booklet: English, French and Flemish. I don't know which one actually is the original one, but let's add a couple of Google translations based on the English version:


Original English: An extraordinary journey across the worlds' continents and warm seas.The Brussels aquarium contains a fantastic universe of small aquatic species from all over the world.

Original French: Un voyage extraordinaire à travers tous les continents et les mers chaudes. L'Aquarium de Bruxelles présente un univers fantastique de petites espèces aquatiques du monde entier.

Google French: Un voyage extraordinaire à travers les continents du monde et des mers chaudes. L'Aquarium de Bruxelles contient un univers fantastique de petites espèces aquatiques de partout dans le monde..

Original Dutch: Een wonderbaarlijke reis langs de verschillende continenten en zeeën. Het aquarium van Brussels laat u kennismaken met een fascinerende verzameling kleine waterdieren van overal ter wereld.

Google Dutch: Een buitengewone reis over de continenten van de wereld en warme zeeën. Het Aquarium van Brussel bevat een fantastische universum van kleine waterdieren soorten uit de hele wereld.

The average tourist will be well served by any of the human-made translations, but they are not totally parallel. The seas are not warm in the Dutch version, "wonderbaarlijk" is NOT the same as "extraordinairy" and "extraordinaire", and "contains" is different from both "présente" and "laat u kennismaken". Besides I would have expected a "van" in front of "kleine", but otherwise nothing in any version indicates that is different from the two others. If you compare the human-made versions with the machine translations, then the latter generally are more loyal to the original - but they are also more likely to present grammatical and lexical errors.

So far GT has passed the test with flying colours, but lets add a few more languages.

Original English: An extraordinary journey across the worlds' continents and warm seas.The Brussels aquarium contains a fantastic universe of small aquatic species from all over the world.

Google Indonesian: Sebuah perjalanan yang luar biasa di seluruh benua di dunia dan laut yang hangat. Brussels Aquarium berisi semesta fantastis spesies air kecil dari seluruh dunia..

Google Russian: Необыкновенное путешествие по континентам и теплых морей. Брюссельская аквариум содержит фантастический вселенную небольших водных видов со всего мира.

Google Latin: Extraordinarius itinere mundi, orbis, maria, et calida. Brussels continet in Aquarium a fantastic mundo parum aquatilium ex totus super orbis terrarum.


There are a number of incontestable errors here. Some are so gross that you easily can spot them, while others are just as sneaky as those in the humanmade translations. But one thing is obvious, namely that Google really tries to be loyal to the original - even when it fails. For instance the seas are warm everywhere, although the Latin version totally bungles up the construction of the sentence.

I'm not an expert on Bahasa Indonesia, but the Indonesian version is actually quite good - it even uses the connector element "yang", which doesn't exist in English, and the seas (or sea) are warm.

The Russian version pinpoints one main problem of GT, namely concord. It is a mystery why those mechanisms that lead to concord between Необыкновенное and the neutrum путешествие can't rescue фантастический вселенную. And understanding why GT in its present state can commit such errors is central to any effort to improve it.

The Latin version is so bad that it can't even be used as a loose reference. In fact it exemplifies just about any of the error types I'll discuss later. And the reason? Well, a combination of few and mostly old bilingual sample texts plus a free word order and a very complicated morphology in the language itself, which the statistical methods of Google apparently haven't been able neither to deduct nor to exploit. The irony of this is that each content word in isolation more often than not is correctly translated.
9 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4782
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 15020

Re: A couple of suggestions to Google Translate from a heavy user

Postby Iversen » Sat Aug 13, 2016 11:25 pm

The translations that GT proses are based on parallels between the two versions in bilingual texts. In principle there is no reason that one of them should be in English, and it has been claimed that some languages in fact are translated via a related one (like Catalan through Spanish), but there is a simple test that proves that English in practice functions as intermediary language: simply insert an English word (that doesn't have a homonyme) in a text in another language and let Google translate it. In most cases the English word will be seemlessly translated along with the other words in that text. As a general rule this won't function with other languages. So there is a real risk that English language patterns influence the results of the translations. And also that English words sometimes pop up in the middle of a translation into a completely different language.

The English Wikipedia article about GT gives this piece of dubious advice:

Google Translate does not translate from one language to another (L1 → L2). Instead, it often translates first to English and then to the target language (L1 → EN → L2). However, because English, like all human languages, is ambiguous and depends on context, this can cause translation errors. For example, translating vous from French to Russian gives vous → you → ты OR Bы/вы. If Google were using an unambiguous, artificial language as the intermediary, it would be vous → you → Bы/вы OR tu → thou → ты. Such a suffixing of words disambiguates their different meanings.

I write "dubious" because it seems that you have to tweek the original text to produce something that will be translated into correct sentences at the destination. But that means that you have to know what the result should be - and then you could just as well write the correct version yourself.

Let's see what how GT right now deals with the 2. person singular problem:

Vouz avez suivi un lien qui n'existe pas encore ---> Вы перешли по ссылке, которая пока не существует
Tu as suivi un lien qui n'existe pas encore ---> Вы перешли по ссылке, которая пока не существует

The second French sentence has "tu" , which is 2. person singular. But as you can see the end result is 2. person plural in Russian. And this is typical, except with the archaic "thou" in English which gets translated into ты (which is unequivocally singular). It would be tempting to add a specific rule to solve this problem, but then you would have to study all the relevant language combinations, and that's precisely the thing we shouldn't accept in a mechanism that thakes a pride in not being rule based.

The solution could be to add invisible tags to words - in this case something like 2p sing. And where should these tags come from? Well, it may be necessary to add them to the system by hand, but how big is the task? In how many ways can we see in a concrete sentence that the subject is in the 2. person singular? In the languages I know the information is buried in a few pronouns plus some verbal endings - but they may have to be treated in different ways. For the pronouns I suggest a translation into an intermediary stage something like "tu" ---> "you 2ps", which would be decoded into Russian as "ты".

But there is more to it than this (which could be accomplished by adding some 'priviliged' translations which after the introduction would be used according to the usual statistical apparatus). We have already seen that Russian concord is quite unreliable, and even this could be solved if the verbal form would be chosen on the basis that the subject 'really' was a 2. person singular. But how can this be implemented?

My suggestion is that there should be a separate session where the usual statistical apparatus was used on sentences where the subject was a pronomen provided with the extra tag, that indicated its number, person and case. Then the tag could be adopted by the different verb forms on a fairly secure basis, instead of being discovered more or less randomly - or not at all - and the tag that was transferred from an original text would be used to chose both the correct pronominal form and the correct verbal form.

NB: the same transferral mechanism would function differently with for instance gender and case. You can't expect that a masculine French word has a masculine Russian word as its most likely translation. But if a Russian substantive always came along with a tag showing its gender then adjectives and past tense verbs could 'read' the tag and assume the same gender. As for the case the situation is slightly more complicated. Modern French substantives don't have cases, but even if you translated from for instance German - where there are four cases of substantives - into Russian you couldn't be sure you should use the same case in both languages - insofar they both had it. Gender has to be decided on the basis of information from within each language - but it would be a great help, if you for instance had a system with case tags attached to prepositions.

How can you get that information? Again, prepositions is a closed word class, and it wouldn't be too much of an effort to add prepositions with tags to the system and then let the usual bayeisan statistical methods do the rest. The main obstacle could be if the solution with hidden tags proved to be unpalatable for the employees at Google.
5 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4782
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 15020

Re: A couple of suggestions to Google Translate from a heavy user

Postby Iversen » Sat Aug 13, 2016 11:27 pm

Let's leave grammar for a moment and return to a simpler problem: unknown words. And for this I have found the following Icelandic example, where the original text comes from the homepage of Reykjavík Park and Zoo:

Sauðburður er hafinn. (...) Ærin Surtla bar tveimur hrútlömbum og ærin Melkorka bar þremur gimbrum snemma morguns þann 4.maí.
GT: Lambing is underway. (...). Aerials acid bar two ram lambs, and there are clear Melkorka bar three ewe early on 4.maí.
correct: Lambing is underway. (...) The ewe Surtla bore two ram lambs and the ewe Melkorka bore three ewe lambs in the morning on the 4 May.

"Aerials acid bar two ram lambs" should awake the suspicion of any language user - and it exemplifies the idea that most errors of Google are relatively innocuous precisely because they are too gross to go unnoticed. In fact "ærin" means "the (mother)sheep" - it's an irregular substantive. Surtla is a proper name, as evidenced by the capital letter - it is related to the name of Surt, the god of fire in the old Norse Mythology, not to súr (sour) - and as a proper name it should not be translated. And then "bar". In the original it is a form of the irregular verb "bera" - neither a place to have a drink nor an oblong object of some kind.

In my opinion a few lines more programming plus maybe the addition of a suitable human made list of Icelandic irregular verbs might result in translations where many of those blatant errors are eliminated. It may be against the spirit of Google or be too costly, but it would just be perfect if the Google people added a complete Icelandic - English dictionary with genders and word classes to the data storage system of GT. If they entered the system with a weight that made a real difference to the statistical system that produces the translations then we would in one fell swoop get rid of most of the errors that are caused by lacking word definitions - and we would do it without sacrificing the statistical system, since the intervention would take place at the stage where raw data are fed into the system.

Right now GT seems to have at least three different reactions to unknown words: skip them, leave them untranslated and invent a translation. And one class of idiocies in particular has to be eradicated, namely the substitution of one city or currency or some other entity by another, which is more commonly used in the destination language. As in this example, where the original and the first French translation are taken from the textbook "Parler Malgache", the second was provided by Google on a bad day:

"Parler Malgache": tia sakafo malagasy aho.
"Parler Malgache": j'aime la nourriture malgache
Google: I love Thai food

A good dictionary might have translations for some of these things - like "Leghorn" in English for the Italian town Livorno - and any proper name that didn't appear among these should be left untranslated. And I mean untranslated, not skipped.

One wordclass that for unknown reasons often is skipped is the negations - which results in sentenced that has the opposite meaning of the intended one:

És un greuge molt gran que se’ns fa: no se’ns deixa escollir lliurament com volem cuidar i educar els nostres infants. (quote: Joan M. Girona)
Google: It is an offense that makes us great: it lets us choose how we care delivery and educate our children.
Literal: it is a grave injustice which we do ourself: (one) doesn't let ourselves choose freely how we want to care about and educate our children.

No word should be skipped, and "no" here shouldn't even be unknown to GT - so why didn't it make it to the translation?
4 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4782
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 15020

Re: A couple of suggestions to Google Translate from a heavy user

Postby Iversen » Sat Aug 13, 2016 11:30 pm

The parallels which the secret software discovers aren't limited to single words - if two expressions in different languages often appear as translations of each other then they may be added to the repertoire of GT. But to achieve this both expressions have to be quite common, which explains that some expressions are translated literally, while others are translated word by word - as in this example:

Slide025_Bilingual-no-cow-on-the-ice.jpg

As I mentioned earlier the production of bilingual texts is definitely the main reason for my use of Google translate. And this use of the system differs significantly from the one where you want it to produce pretty, idiomatic phrases in the language you are translating to. As I written before, most translations are made for people who can't read the original texts and aren't interested in learning how to do it. For them the ideal would be to get the meaning, even if this means that a passage has to be completely formulated. For me as a language learner the literal meaning of the componenst of an idiomatic expression are as important as the 'added' meaning of the whole thing.

This means that I'm more tolerant of unidiomatic word-for-word translations than those who judge the translations according to their qualities as uterrances in the translation language. Ultimately you will of course need to learn the loose, purely semantically motivated idiomatic expressions of your target language and which expressions they can be coupled to in your own language, but I see the first and most important task in telling people what the speakers of the target language really say - that will make it easier to learn the 'added', unpredictable idiomatic meaning afterwards. So if Google gives me a word to word translation, I'm not in tears.

In the example above, the Danish expression to the left means "there is no need to worry", but if you don't speak Danish or Swedish then you would not have guessed it from the English translation. But precisely because it is such a weird expression it is easy to remember the intended meaning, just somebody tells you what it is. And I would like GT to show the 'global ' meaning in the form of a parallel expression if it has found one - but even then I would like also to be shown the word-by-word translation, for instance in the small drop-down menu with alternative translations.
You do not have the required permissions to view the files attached to this post.
6 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4782
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 15020

Re: A couple of suggestions to Google Translate from a heavy user

Postby Iversen » Sat Aug 13, 2016 11:31 pm

Finally I would like to mention the problems with word order. Since computers aren't good at pattern recognition these problems is much harder to solve without human intervention than simple identification of potential single word translations. That being said, I have problems understanding how examples like the following can occur:

physics4u.gr/articles
Το βáρος του Homo habilis έφτανε τα 45 κιλá περíπου, το úψος του τα 1,5 μέτρα, áτριχος και αρκετá μελαμψóς

Google: The weight of Homo habilis reaches approximately 45 kg, height of 1,5 meters, and quite hairless melampsos.

corrected: The weight of Homo habilis reaches approximately 45 kg, the height 1,5 meters, hairless and quite dark

Google should simple let things that stand before the 'kai' stay there, and things that stand after it stay there. The thing I don't understand is why the algorithms change the word order in the first place - "και" in "áτριχος και αρκετá μελαμψóς" already stands in the correct position (the same as "and" in the corrected version version above) so why change it? It is as if somebody had introduced some word order rules into the system and now they malfunction, but no one knows how to remove them without the whole system being compromised.

But sometimes word order has to be changed. How is this done? Again I suspect that somebody added some rules, and I have an example that could support that suspicion:

inismagazine.ie/features/entry/bearla-na-leabhar
Pléann Anna Heussaff roinnt de na saincheisteanna a bhaineann le haistriúchán ar litríocht na n-óg.

Goggle: Irish to English:
Anna Heussaff discusses some of the issues related to the translation of youth literature.

Google: English to Irish:
Anna Heussaff Pléann cuid de na saincheisteanna a bhaineann le aistriúchán ar litríocht óige.


If Google can move Anna Heussaff to the first position in the English sentence, why can't another rule then move her back to the second position in the Irish sentence? The change from Irish to English is due to the rule that says that there should be a subject in any English sentence, and it should be put in front of the verb. But it can't be statistical analysis that lead to this result - otherwise the same system would have discovered that practically all Irish sentences have the verb in the initial prosition! Only conjunctions and some verbal particles can normally stand in front of the verb. My guess is that somebody wrote a rule for English, but didn't bother to produce one for Irish.

In German I have found that finite verbs mostly end up at the end of subordinate clauses, as they should, but there are cases where this rule isn't observed.

To explain word order you would resort to identification rules for the parts in sentences and other types of phrases in traditional grammar. But precisely these notions are taboo in a statistically based system, so it is hard to see how to solve the problems and do the necessary transformations in such a system. But the addition of word class and case markers which I proposed earlier would be a great step towards a solution where the necessary rules could be identified through statistical analysis.
8 x

User avatar
tarvos
Black Belt - 2nd Dan
Posts: 2889
Joined: Sun Jul 26, 2015 11:13 am
Location: The Lowlands
Languages: Native: NL, EN
Professional: ES, RU
Speak well: DE, FR, RO, EO, SV
Speak reasonably: IT, ZH, PT, NO, EL, CZ
Need improvement: PO, IS, HE, JP, KO, HU, FI
Passive: AF, DK, LAT
Dabbled in: BRT, ZH (SH), BG, EUS, ZH (CAN), and a whole lot more.
Language Log: http://how-to-learn-any-language.com/fo ... PN=1&TPN=1
x 6094
Contact:

Re: A couple of suggestions to Google Translate from a heavy user

Postby tarvos » Sun Aug 14, 2016 9:31 am

Besides I would have expected a "van" in front of "kleine", but otherwise nothing in any version indicates that is different from the two others. If you compare the human-made versions with the machine translations, then the latter generally are more loyal to the original - but they are also more likely to present grammatical and lexical errors.


You don't need a van actually. That's an English construction - Dutch routinely disposes of this and will say things like "een verzameling stenen", "een hoop dieren", "een veld bloemen" etc.
0 x
I hope your world is kind.

Is a girl.

Cainntear
Black Belt - 3rd Dan
Posts: 3527
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8794
Contact:

Re: A couple of suggestions to Google Translate from a heavy user

Postby Cainntear » Sun Aug 14, 2016 9:50 am

First up, I'll say this: your usage of Google Translate is atypical, in that you are doing exactly what it was designed for. Machine translation should only ever be used by the recipient of a text in order to understand it (I wish they'd called it "Google Gist" or something...)

The reason that GT gets slagged off so much is that most people don't know this, and end up using it to translate an email before sending or some equally bad idea. (I remember once trying to decipher a GT email from the owner of a self-catering cottage in Galicia for a friend. Typos and missing spaces had messed up the translation and the original Spanish was not available. Lots of fun.)

Iversen wrote:But sometimes word ord has to be changed. How does this happen? Again I suspect that somebody added some rules, and I have an example that could support that suspicion:

inismagazine.ie/features/entry/bearla-na-leabhar
Pléann Anna Heussaff roinnt de na saincheisteanna a bhaineann le haistriúchán ar litríocht na n-óg.

Goggle: Irish to English:
Anna Heussaff discusses some of the issues related to the translation of youth literature.

Google: English to Irish:
Anna Heussaff Pléann cuid de na saincheisteanna a bhaineann le aistriúchán ar litríocht óige.


If Google can move Anna Heussaff to the first position in the English sentence, why can't another rule then move her back to the second position in the Irish sentence? The change from Irish to English is due to the rule that says that there should be a subject in any English sentence, and it should be put in front of the verb. But it can't be statistical analysis that lead to this result - otherwise the same system would have discovered that practically all Irish sentences have the verb in the initial prosition! Only conjunctions and some verbal particles can normally stand in front of the verb. My guess is that somebody wrote a rule for English, but didn't have time to produce one for Irish.

I don't think that's the case. In the early days of statistical translation, word order errors affected practically every language pair. This was because the systems hadn't seen enough material to generalise the rules and were really only working on chunks of language that were seen in the bilingual database. The breakthrough that made Google Translate what it is today was that somebody realised that the statistical engine didn't need to be based entirely on a bilingual database and started feeding monolingual material in as well. In this way, the English language engine has a statistical model that could (if the programmers asked) take an ill-formed sentence and rejig it to create a well-formed one. The problem is that the ill-formed sentence may have incomplete semantics, and so the well-formed sentence won't necessarily be a correct translation.

English is GT's strongest language, because the engine has been fed with more data than any other language. The difference between the size of the Irish and English daabases is astronomical -- compare them numerically and you'll lose count of all the zeros. The Irish language engine does not have enough data to create a model of the language, so it cannot correct word order. The results for Irish are about as good as the results I saw in English<->French in the early days of statistical translation before the introduction of monolingual materials.

It's this monolingual database model that leads to weird things like cuisines having their nationality altered. When something isn't translated often or even at all (and in English "malagache" food is so uncommon that I'm not sure what the appropriate adjective in English is) then the statistical engine looks for equivalents based on context, frequency etc. Clearly, there is some similarity in how Francophone culture sees food from Malaysia and Anglophne culture seems Thai food. GT's generalisation here is wrong, but it's almost an idiomatic translation, and if GT didn't make errors like this, it would be unable to correctly deal with idiomatic translations.

You could be right -- they may have moved to.a hybrid of statistical and rule-based translation without telling anyone, but nothing you've presented isn't consistent with pure statistical translation.
2 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4782
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 15020

Re: A couple of suggestions to Google Translate from a heavy user

Postby Iversen » Sun Aug 14, 2016 11:44 am

The introduction of monolingual source texts doesn't lend itself as readily to statistical methods as the bilingual texts do UNLESS you also introduce some kind of grammatical analysis function - which is one step away from reintroducing grammatical rules, although not necessarily in the form they have in classical grammar.

If the system notices that the word "cheval" often occurs in French sentences where the word "horse" occurs in the coresponding English sentences then it should tentatively assume that they mean the same thing (but GT doesn't care what kind of thing that is - except insofar it can discover some other words that also tend to appear more frequently when one of the two is present). And it may in principle be within the capabilities of a statistical machine to make a list of possible conjunctions based on their position in sentences, but to really do even a simple task like this you need to be able to find the borders between phrases. The frequency of errors even with the compaartively simple coordinating conjunctions show that this is a problem for Google. If you remember the mistranslation from "An extraordinary journey across the worlds' continents and warm seas" into Latin "Extraordinarius itinere mundi, orbis, maria, et calida" then GT could just have kept ("the) worlds' continents" and "warm seas" together, separated by "and" during the translation process, and the result would have been OK. So what made it fiddle around with the word order in the first place, and when it did (because it might have to invert the order of substantives and adjectives), why didn't it at least respect the border marked by the conjunction? Probably because it hasn't been told what substantives and adjectives and conjunctions are, and therefore it can't understand the structure of a typical nominal phrase, and because of that it doesn't respect its borders.

It is pretty clear that this basically is a question of giving the thing the tools it needs to discover generative patterns, and I believe that it would help telling the beast about wordclasses and cases and other morphological distinctions, combined with a rule that says "let word order alone until you know how to manipulate it properly".

The unwanted substitutions can be minimized if only GT learns to respect proper names UNLESS it has been informed of a proper translation - and as Cainntear rightfully point out this last thing can't be done purely by performing statistical analyses within each of the two languages. But in many languages proper names can be identified through the convention that they start with a capital letter (although this wouldn't help in German), and else it is a thing that could be stuffed down the throat of the system by providing it with a dictionary that indicates wordclasses. Or maybe the index of a world map and a couple of other thematic lists. Adjectival derivations might still be a problem, and other culture bound words like institution names, currencies and names of TV programs would also have to be put on a black list of words that shouldn't ever be substituted and only translated if they can be checked against a green list of verified translations. But this markup should all happen at the input level, and if the correct information information get a sufficiently high statistical weight then the existing transformation rules should come up with less errorridden translations.

Does this affect the ability to identify idiomatic expressions? No, not really. It is still possible to perform monolingual analyses that identify the cases you can expect after specific prepositions or the typical prepositions that follow specific verbs so finding fixed expressions won't be a problem. But since GT doesn't understand the meaning of anything directly it is only possible to identify corresponding expressions in different languages if somebody tells the system that they effectively mean the same thing, or if they occur frequently enough in tandem in bilingual texts to be accepted by the system after a purely statistical analysis.
1 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4782
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 15020

Re: A couple of suggestions to Google Translate from a heavy user

Postby Iversen » Wed Aug 17, 2016 3:17 pm

I can't just stop thinking about GT because I have reached the end of the things I had to say in Berlin so here are some more experiments:

Some languages have words that indicate questions, others use inversion and English may use an auxiliary verb to avoid the inversion. How does GT treat these differences? I have omitted all question marks in order to make certain that GT doesn't use them, but I don't really see any evidence that it does.

Let's start out with Esperanto:

ĉu --> whether
ĉu ekzistas (/estas) pli en la sako --> whether there is (/ are) more in the bag
ĉu GT povas vidi ke tiu estas demando --> either GT can see that this is a question
ĉu la Luno estas farita el verda fromaĝo --> whether the moon is made of green cheese
ĉu GT skribis tion --> either GT wrote it

In a subordinate clause "whether" would be an excellent translation - but not in main phrases as here. And there is simply no escuse for the proposing "either". It is clear that the translation of "ĉu" as "whether" is used way beyond the situations where it is correct, but GT has apparently not been able to see that it was absent from main sentences - maybe because there isn't really one single word that is used in its stead insofar that the most logical solution would have been to use inversion: "Is there more in the bag (or not)", "can GT see that this is a question", "Is the Moon made of green cheese".

Danish --> English
Mon ---> Mon
Mon der er mere i posen ---> Surely there is more in the bag
Mon GT kan se at det her er et spørgsmål ---> Mon GT can see that this is a question
Mon Månen er lavet af grøn ost ---> Mon moon is made of green cheese
Mon GT skrev dette --> Mon GT wrote this

The Danish word "mon" (which actually is a rest of an old modal verb "at monne") clearly gives GT some problems. It is essentially a question marker as "ĉu", but also contains the information that you are in doubt about something. Mostly it is left untranslated - in spite of the fact that the only English word that looks the part is the name of an Asian tribe - but in all cases the alternatives given contain something like "I wonder whether" - and that would actually be the best possible good translation! Why wasn't it chosen by the algorithms?

The translation of no. two is patently wrong - it is NOT sure that there is more in the bag - but several alternatives with "I wonder" are given as alternatives, and it is hard to see why precisely the one really ghastly alternative is chosen. Btw this experiment reminds me of a good old English word that without being interrogative in nature does hint at the doubt which is implied in the Danish sentences with "mon", namely the excellent term "methinks". Or you could use a normal question with an extra "by chance" inserted somewhere. But leaving a common word like "mon" untranslated just because English doesn't have a precise counterpart is unpardonable.

Second BTW: I tried to translate the Danish examples into Esperanto to see what would happen, and the result was a disaster:

Mon ---> mon
Mon der er mere i posen ---> Certe estas pli en la sako
Mon GT kan se at det her er et spørgsmål ---> Mon GT povas vidi ke tiu estas demando
Mon Månen er lavet af grøn ost ---> Mon luno estas farita el verda fromaĝo
Mon GT wrote this --> Mon GT verkis

You could use "ĉu" in all these cases, although it wouldn't have the ring of doubt and second thoughts that the Danish "mon" implies. But it is clear that the vissicitudes of the bad English translations above are carried on directly into the Esperanto versions - and that's worse.

I finally made a series more with "mon" and English, but this time I placed "mon" later in the Danish questions, and then there is inversion in the Danish examples. But it doesn't help - with the possible exception of no. 1 the translations are still terrible:

Er der mon mere i posen ---> Is there more in the bag
Kan GT mon se at det her er et spørgsmål ---> Can GT mon see that this is a question
Er Månen mon lavet af grøn ost ---> Is Moon mon made of green cheese
Skrev GT mon dette ---> Wrote GT wonder this
0 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: gsbod, zjfict and 2 guests