Machine Translation

General discussion about learning languages
lichtrausch
Blue Belt
Posts: 511
Joined: Thu Jul 23, 2015 3:21 pm
Languages: English (N), German, Japanese, Mandarin, Korean
x 1381

Machine Translation

Postby lichtrausch » Mon May 16, 2022 2:13 pm

You might have noticed some new languages on Google Translate such as Quechua and Lingala. This is the backstory.

Unlocking Zero-Resource Machine Translation to Support New Languages in Google Translate
Machine translation (MT) technology has made significant advances in recent years, as deep learning has been integrated with natural language processing (NLP). Performance on research benchmarks like WMT have soared, and translation services have improved in quality and expanded to include new languages. Nevertheless, while existing translation services cover languages spoken by the majority of people world wide, they only include around 100 languages in total, just over 1% of those actively spoken globally. Moreover, the languages that are currently represented are overwhelmingly European, largely overlooking regions of high linguistic diversity, like Africa and the Americas.

There are two key bottlenecks towards building functioning translation models for the long tail of languages. The first arises from data scarcity; digitized data for many languages is limited and can be difficult to find on the web due to quality issues with Language Identification (LangID) models. The second challenge arises from modeling limitations. MT models usually train on large amounts of parallel (translated) text, but without such data, models must learn to translate from limited amounts of monolingual text, which is a novel area of research. Both of these challenges need to be addressed for translation models to reach sufficient quality.

In “Building Machine Translation Systems for the Next Thousand Languages”, we describe how to build high-quality monolingual datasets for over a thousand languages that do not have translation datasets available and demonstrate how one can use monolingual data alone to train MT models. As part of this effort, we are expanding Google Translate to include 24 under-resourced languages. For these languages, we created monolingual datasets by developing and using specialized neural language identification models combined with novel filtering approaches. The techniques we introduce supplement massively multilingual models with a self supervised task to enable zero-resource translation. Finally, we highlight how native speakers have helped us realize this accomplishment.

New languages added to Google Translate:

Image

The amount of monolingual data per language versus the amount of parallel (translated) data per language:

Image

Paper: Building Machine Translation Systems for the Next Thousand Languages
11 x

lichtrausch
Blue Belt
Posts: 511
Joined: Thu Jul 23, 2015 3:21 pm
Languages: English (N), German, Japanese, Mandarin, Korean
x 1381

Re: Machine Translation

Postby lichtrausch » Wed Jun 22, 2022 7:55 pm

Advancing direct speech-to-speech modeling with discrete units (follow link for examples of the method in action, using Spanish and English)
To make it possible for people to easily understand each other while speaking in different languages, we need more than just text-based translation systems. But the conventional approach to building speech-to-speech translation systems has faced two significant shortcomings. It uses a cascaded series of steps — speech recognition, then text-to-text translation, and finally conversion of translated text back to speech — where the computational costs and inference latency accumulate in each stage. In addition, more than 40 percent of the world’s languages are without text writing systems, making this approach infeasible for extending translations to every spoken language.

To enable faster inference and support translation between unwritten languages, Meta AI is sharing new work on our direct speech-to-speech translation (S2ST) approach, which does not rely on text generation as an intermediate step. Our method outperforms previous approaches and is the first direct S2ST system trained on real-world open sourced audio data instead of synthetic audio for multiple language pairs.

Illustration of the direct S2ST model with discrete units:
Image
2 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: terracotta and 2 guests