Two articles about software localization

Postby **emk** » Wed Apr 17, 2019 2:29 pm

(This is slightly off-topic for anybody but professional software translators and programmers, but I thought some people here might find it interesting!)

Software localization is the process of translating the human-visible text in a software program. (It also includes fixing number and date formats, and other things along those lines.)

As you might guess, the problem is that languages are tricky. In this older article by the linguists Sean M. Burke and Jordan Lachler, they show all the headaches involved in translating the following bits of text:

I scanned 12 directories.

Your query matched 10 files in 4 directories.

I thought it was a pretty amusing essay:

Then your Russian translator calls on the phone, to personally tell you the bad news about how really unpleasant your life is about to become:

...

He elaborates: In "I scanned %g directories", you'd expect "directories" to be in the accusative case (since it is the direct object in the sentence) and the plural number, except where $directory_count is 1, then you'd expect the singular, of course. Just like Latin or German. But! Where $directory_count % 10 is 1 ("%" for modulo, remember), assuming $directory count is an integer, and except where $directory_count % 100 is 11, "directories" is forced to become grammatically singular, which means it gets the ending for the accusative singular... You begin to visualize the code it'd take to test for the problem so far, and still work for Chinese and Arabic and Italian, and how many gettext items that'd take, but he keeps going... But where $directory_count % 10 is 2, 3, or 4 (except where $directory_count % 100 is 12, 13, or 14), the word for "directories" is forced to be genitive singular -- which means another ending... The room begins to spin around you, slowly at first... But with all other integer values, since "directory" is an inanimate noun, when preceded by a number and in the nominative or accusative cases (as it is here, just your luck!), it does stay plural, but it is forced into the genitive case -- yet another ending... And you never hear him get to the part about how you're going to run into similar (but maybe subtly different) problems with other Slavic languages like Polish, because the floor comes up to meet you, and you fade into unconsciousness.

(Thank you to zoul who linked to this article on another site.)

Mozilla has just released a specification for a new translation format that tries to address these issues. An example:

This is designed to support grammatically correct translations for a wide range of languages, including languages with really unusual grammatical constraints. And it's designed to work even when the original software authors don't understand Russian grammar.

eido · Postby **eido** » Wed Apr 17, 2019 2:56 pm

I don't know if you have to be a developer to necessarily understand. I've only ever messed with the HTML in Tumblr themes and I understood

It's fascinating they came up with a code for that.

zenmonkey · Postby **zenmonkey** » Wed Apr 24, 2019 11:15 am

That's a cute read. Thanks!

But it's also part of the struggle to get quality localisations. I've turned down more than one agency that tried to sell me on just "how complex" localisation was and that I wouldn't be able to get our products out to the right audience without them.

Whistles and bells, sometimes.

Edit. Looking at the Fluent project, if i understand correctly, this is definitely NOT the way to have a mid to long term solution where each message string needs to be grokked into complex grammatical rules for multiple languages. I would suspect that the set of inputs with variables and resulting outputs of the 13000 localised phrased can be analysed for a set of rules across languages that would identify where a system needs to be disambiguated from a source language versus where the machine translation is precise.

"...see her 3 photos."

One probably should not need to qualify that in this type of phrase photos are possessed objects of "her" versus

"... see her in 3 photos" where here "her" is object seen.

to German

" ... siehst du ihre 3 fotos" vs "... siehst du sie auf 3 fotos" as an example. Google translate does this well enough, having to explicitly define it in the source phrase to be translated, seems like a step in the wrong direction. (except where necessary by ambiguity of source language.)

Postby **Iversen** » Wed Apr 24, 2019 3:22 pm

I haven't studied this at the rock bottom software level, but once I discussed how some errors in Google Translate could be avoided if words were tagged with morphological markers which then could be exploited by the automatic discovery processes. The people behind Google Translate concluded long ago that writing down the grammatical rules for scores of languages by hands would be an impossible task in the long run - there are simply not enough linguists in the world to do it. So the best developers can do now is to give the automatic processes some better tools to do the necessary mapping.

And I still don't understand why there can be unknown words in a program like Google T when it just could have 'read' a couple of dictionaries and culled the missing words from there.

Or maybe you are speaking about something totally different ...

A language learners’ forum

Two articles about software localization

Two articles about software localization

Re: Two articles about software localization

Re: Two articles about software localization

Re: Two articles about software localization

Who is online