Is there a way to remove all words which have English cognates from this Spanish list?

General discussion about learning languages
Hashimi
Green Belt
Posts: 291
Joined: Sun Jan 10, 2016 12:45 pm
x 394

Is there a way to remove all words which have English cognates from this Spanish list?

Postby Hashimi » Sat Mar 09, 2019 1:30 pm

This is a list published by the Real Academia Española (RAE) from analysis of more than 160 million word forms found in the Corpus of Current Spanish. I want to keep only the words that have no cognates in English. What is the best way to do this?

http://corpus.rae.es/frec/10000_formas.TXT
1 x

Doitsujin
Orange Belt
Posts: 179
Joined: Sat Jul 18, 2015 6:21 pm
Languages: German (N)
x 331

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby Doitsujin » Sat Mar 09, 2019 4:00 pm

Hashimi wrote:What is the best way to do this?
That depends on your technical skills and your budget. If I had to do this, I might try one of the following methods:

a) Use approximate string matching

1. Translate the whole list with DeepL to English and save the Spanish and English words in a spreadsheet file.
2. Use the Levenstein distance or a proper Fuzzy Matching algorithm to calculate the similarity between the two words. E.g., manera <=> manner; Levenstein distance = 2; similarity ≈ 67%. (Words that start with the same letter and have a similarity of ≥ 65% are most likely cognates.)

b) Use an existing cognate list

1. Several text books contain English-Spanish and/or Spanish-English cognate lists, for example, Resnick, Seymore - Essential Spanish Grammar.
There are also lots of cognate lists on the Internet, for example, this one.
2. It should be relatively easy to OCR them and to check the entries in the RAE list against them.

c) Outsource the task to low-cost websites, e.g. Amazon Mechanical Turk or fiverr.
Last edited by Doitsujin on Sat Mar 09, 2019 4:49 pm, edited 1 time in total.
6 x

User avatar
Ser
Green Belt
Posts: 355
Joined: Thu Dec 01, 2016 5:28 am
Location: Vancouver, British Columbia, Canada
Languages: Spanish (N), English (feels like another mother tongue but it's not), French (intermediate), Latin/Ancient Greek/Mandarin (still sucking at them)
Language Log: https://forum.language-learners.org/vie ... =15&t=8737
x 833

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby Ser » Sat Mar 09, 2019 4:34 pm

I'd just like to mention that doing it programmatically using strategy b) is actually harder than you outline, Doitsujin, as the RAE's file lists the same word multiple times in different shapes, e.g. the singular noun parte at #64 and its plural partes at #665, or the verb cubrir in various conjugation forms at #2930, #3618, #4566, #5323, #7415, #7492, #9427 and #9619. For strategy b), a morphological analyzer of Spanish should therefore be used within the program in order to recognize the citation form of words, which then could be matched against the cognate database created from various sources.

It goes without saying that both strategy a) and strategy b) will also have plenty of false positives, as strategy a) relies on DeepL choosing a cognate (it may not even if there is one) and the Levenstein distance may be just under 65 for what are actually cognates, and as strategy b) has the problem that the RAE's list will have derived cognates not included in the database. The RAE's list also includes personal given names and family names such as Adolfo, Juanito, Ramírez and Menem, and things like Roman numerals (e.g. XVI at #3708).
2 x

Hashimi
Green Belt
Posts: 291
Joined: Sun Jan 10, 2016 12:45 pm
x 394

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby Hashimi » Sat Mar 09, 2019 11:26 pm

Doitsujin wrote:b) Use an existing cognate list


I have a list of more than 30,000 cognates but they are in English. I got it from this website:

http://www.cognates.org/h/index.html

You can enter any English text (or wordlist), and it will highlight the cognates in Spanish and other Romance languages. It uses this script:

http://www.cognates.org/h/ecl.js

But I want it the other way around, to highlight (or remove) the cognates in a Spanish text or wordlist.


Ser wrote:I'd just like to mention that doing it programmatically using strategy b) is actually harder than you outline, Doitsujin, as the RAE's file lists the same word multiple times in different shapes, e.g. the singular noun parte at #64 and its plural partes at #665, or the verb cubrir in various conjugation forms at #2930, #3618, #4566, #5323, #7415, #7492, #9427 and #9619. For strategy b), a morphological analyzer of Spanish should therefore be used within the program in order to recognize the citation form of words, which then could be matched against the cognate database created from various sources.

It goes without saying that both strategy a) and strategy b) will also have plenty of false positives, as strategy a) relies on DeepL choosing a cognate (it may not even if there is one) and the Levenstein distance may be just under 65 for what are actually cognates, and as strategy b) has the problem that the RAE's list will have derived cognates not included in the database. The RAE's list also includes personal given names and family names such as Adolfo, Juanito, Ramírez and Menem, and things like Roman numerals (e.g. XVI at #3708).


That's right. For the RAE's list, the first step is to lemmatize everything and remove proper names etc.

However, this is list is an example, and I have other lists that contains the lemmatized forms and no proper names.
0 x

User avatar
zenmonkey
Black Belt - 2nd Dan
Posts: 2012
Joined: Sun Jul 26, 2015 7:21 pm
Location: Germany and France
Languages: Spanish, English, French trilingual - actively studying German (B2/C1), Hebrew, Tibetan, Setswana.
Some knowledge of Italian, Portuguese, Ladino, Yiddish ...
Want to tackle Tzotzil, Nahuatl
Language Log: viewtopic.php?f=15&t=859
x 4952
Contact:

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby zenmonkey » Sat Mar 09, 2019 11:42 pm

So this is intended to go and eliminate word like "bolsa", "avergonzado", "en" and "bolsillo" ? Cognates in English are respectively "bourse", "verecund", "in" and a lemma of bolsa...

keeping snow and nieve? Both cognates of nix (nom, akk. nivium).

But I want it the other way around, to highlight (or remove) the cognates in a Spanish text or wordlist.


If you have a cognate list already, pop that into Excel and use a vlookup() function, if it is in the list set to true, and then just keep the false matches...
4 x
Tagged posts: Language Method Resource
Please feel free to correct me in any language, critique my posts, challenge my thoughts.
I am inconsistency incarnate.
Go study! Publisher of Syriac, Aramaic, Hebrew alphabet apps at http://alphabetsnow.zyntx.com

Hashimi
Green Belt
Posts: 291
Joined: Sun Jan 10, 2016 12:45 pm
x 394

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby Hashimi » Sun Mar 10, 2019 11:55 am

zenmonkey wrote:keeping snow and nieve? Both cognates of nix (nom, akk. nivium).


No. I'm talking about cognates that come directly from Latin or French, and not found in Old English or Proto-Germanic.

zenmonkey wrote:If you have a cognate list already, pop that into Excel and use a vlookup() function, if it is in the list set to true, and then just keep the false matches...


The problem is that list2 is in English, so they are not identical in form:

http://www.cognates.org/h/ecl.js
0 x

User avatar
zenmonkey
Black Belt - 2nd Dan
Posts: 2012
Joined: Sun Jul 26, 2015 7:21 pm
Location: Germany and France
Languages: Spanish, English, French trilingual - actively studying German (B2/C1), Hebrew, Tibetan, Setswana.
Some knowledge of Italian, Portuguese, Ladino, Yiddish ...
Want to tackle Tzotzil, Nahuatl
Language Log: viewtopic.php?f=15&t=859
x 4952
Contact:

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby zenmonkey » Sun Mar 10, 2019 6:39 pm

Hashimi wrote:
zenmonkey wrote:keeping snow and nieve? Both cognates of nix (nom, akk. nivium).


No. I'm talking about cognates that come directly from Latin or French, and not found in Old English or Proto-Germanic.

zenmonkey wrote:If you have a cognate list already, pop that into Excel and use a vlookup() function, if it is in the list set to true, and then just keep the false matches...


The problem is that list2 is in English, so they are not identical in form:

http://www.cognates.org/h/ecl.js


You can get the translations using a google sheet and the translate() function.
6 x
Tagged posts: Language Method Resource
Please feel free to correct me in any language, critique my posts, challenge my thoughts.
I am inconsistency incarnate.
Go study! Publisher of Syriac, Aramaic, Hebrew alphabet apps at http://alphabetsnow.zyntx.com

Hashimi
Green Belt
Posts: 291
Joined: Sun Jan 10, 2016 12:45 pm
x 394

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby Hashimi » Sun Mar 10, 2019 8:58 pm

zenmonkey wrote:You can get the translations using a google sheet and the translate() function.


Great! Today I learned something new. The best thing this week. Thank you so much, zenmonkey!
2 x

DaraghM
White Belt
Posts: 30
Joined: Wed Aug 17, 2016 7:57 am
x 63

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby DaraghM » Tue Mar 12, 2019 10:36 am

Can I ask why you want to do this? If the purpose it to learn vocabulary in an efficient manner, by ignoring words that seem similar, you run the risk of missing some key differences even with words that look identical to English. Just flicking through the list quickly, I see the following,

inferior - means lower in a general sense, not only to do with status, as the English word implies.
carbón - carbon, but also coal and charcoal.
oculta - rarely pertains to the occult, but commonly used to mean hide, cover, mask, etc.
3 x

User avatar
Ser
Green Belt
Posts: 355
Joined: Thu Dec 01, 2016 5:28 am
Location: Vancouver, British Columbia, Canada
Languages: Spanish (N), English (feels like another mother tongue but it's not), French (intermediate), Latin/Ancient Greek/Mandarin (still sucking at them)
Language Log: https://forum.language-learners.org/vie ... =15&t=8737
x 833

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby Ser » Tue Mar 12, 2019 10:44 am

DaraghM wrote:Can I ask why you want to do this? If the purpose it to learn vocabulary in an efficient manner, by ignoring words that seem similar, you run the risk of missing some key differences even with words that look identical to English. Just flicking through the list quickly, I see the following,

inferior - means lower in a general sense, not only to do with status, as the English word implies.
carbón - carbon, but also coal and charcoal.
oculta - rarely pertains to the occult, but commonly used to mean hide, cover, mask, etc.

Right. This is also important.

While we're at it, I feel that a very good resource could be made out of this list if only someone who knows both Spanish and English could go through all entries (after merging the instances of multiple forms of the same word into single entries) and add usage notes and grammatical information (verb stems, noun gender, lexical prepositions).

Stop staring at me. It hurts.
0 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: No registered users and 1 guest