NLP database model (rare languages)

General discussion about learning languages
User avatar
lingzz_langzz
Yellow Belt
Posts: 64
Joined: Sat Apr 25, 2020 4:57 pm
Location: Barcelona
Languages: N: Polish, C: English, Spanish, Hungarian, Catalan, B: Italian, Turkish, A: Czech
Language Log: https://forum.language-learners.org/vie ... 15&t=14260
x 118

NLP database model (rare languages)

Postby lingzz_langzz » Wed Jan 25, 2023 9:57 am

Hey!

I'm writing to you as I'm just starting my journey with AI and more specifically - Natural Language Processing. My goal is to use it to build materials for Aragonese and as you can imagine, there're no datasets for this language (and if there's something, I doubt it's the dialect I'm interested in - ribagorzano) so here's the question:

Has anybody here know about people who has built a dataset for a minority language and has documented the process?
Or maybe anybody here even built one themselves?

I'd appreciate any answers, thanks! :D
2 x

User avatar
Deinonysus
Brown Belt
Posts: 1216
Joined: Tue Sep 13, 2016 6:06 pm
Location: MA, USA
Languages:  
• Native: English
• Advanced: French
• Intermediate: German,
   Spanish, Hebrew
• Beginner: Italian,
   Arabic
x 4620

Re: NLP database model (rare languages)

Postby Deinonysus » Wed Jan 25, 2023 10:33 am

I would say to look at this recent thread for a cautionary tale of careless AI use with minority languages without consulting with native speakers before releasing the finished product.

Bad translations into native Alaskan languages

For another horror story of a non-native speaker releasing gibberish materials in a minority language (albeit without AI), look up the Scots Wikipedia scandal. The rogue editor "has possibly done more damage to the Scots language than anyone else in history. They engaged in cultural vandalism on a hitherto unprecedented scale.”

https://www.theguardian.com/uk-news/202 ... -wikipedia

By all means have fun experimenting with machine learning but make sure you take every precaution before releasing something to the public.
7 x
/daɪ.nə.ˈnaɪ.səs/

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4768
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 14962

Re: NLP database model (rare languages)

Postby Iversen » Wed Jan 25, 2023 12:04 pm

I faintly remember that there was a similar issue with the Wikipedia for one of the Southern Slavic languages, and such things can occur if native speakers from a language community don't care to do the job themselves. And when the Wikipedia looks suspicious then native speakers may not even want to look at it.

In the case of the Scots Wikipedia I of course as another base-level non-native user can't judge exactly how bad it is, but I'm going to read through some articles to see whether I can spot the errors (insofar Amaryllis wrote the articles). However similar cases have occurred with other languages. For instance I have seen Irish articles on the internet that didn't even put the finite verbs in the initial position which is one of the first rules you learn in that language. I haven't checked Google translate lately, but I tested it thoroughly a few years ago and noticed that it mostly placed the verbs wrongly when translating from English into Irish, but sometimes got the place correct when translating in the other direction. And then there are of course all the words in different languages where you can see that the route over English has caused misunderstandings. So beware of bad materials and look for symptoms that something has been through a machine (or a newbe learner).

As for finding materials in Aragonese - well, I have not searched, but I guess that it will be an uphill project... There's a lot of stuff in Catalan, but even in Occitan (including the dialect allegedly spoken in the Aran Valley) the choice is limited. And to the best of my knowledge there has never really been a active revival movement destined to make Aragonese survive - but the OP probably knows about any available sources already.

Some languages are described by one linguist at the point where they almost have died out and only a few elderly people still speak them - if they can find anybody to speak to. And their language may have been polluted with elements from surrounding languages - as is claimed for instance for the last speaker of the extinct Dalmatian. But even a description that would satisfy most linguists might not be useful to learn to USE a language. You generally ned a living community to cover all aspects of a human life with words and expressions, and you also need such a community to keep the language (or dialect) alive and interesting.

But apart from that: you may build a dataset, but how comprehensible (and comprehensive too) does it have to be to make the use of AI feasible? Even Google translate has problems with small and/or old languages, and it has its tentacles down into the whole of the internet ...
5 x

User avatar
leosmith
Brown Belt
Posts: 1341
Joined: Thu Sep 29, 2016 10:06 pm
Location: Seattle
Languages: English (N)
Spanish (adv)
French (int)
German (int)
Japanese (int)
Korean (int)
Mandarin (int)
Portuguese (int)
Russian (int)
Swahili (int)
Tagalog (int)
Thai (int)
x 3102
Contact:

Re: NLP database model (rare languages)

Postby leosmith » Thu Jan 26, 2023 9:03 am

lingzz_langzz wrote:Has anybody here know about people who has built a dataset for a minority language and has documented the process? Or maybe anybody here even built one themselves?
What do you mean by "dataset" here? I have commissioned native speakers to create natural conversations. The closest thing to a minority language I've done this with has been Quechua, which is currently in work. Full disclosure - I have a donor who is bearing over 90% of the cost.
2 x
https://languagecrush.com/reading - try our free multi-language reading tool


Return to “General Language Discussion”

Who is online

Users browsing this forum: Google [Bot], nathancrow77, tastyonions and 2 guests