Is there a way to remove all words which have English cognates from this Spanish list?

General discussion about learning languages
Doitsujin
Green Belt
Posts: 402
Joined: Sat Jul 18, 2015 6:21 pm
Languages: German (N)
x 801

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby Doitsujin » Sat Mar 09, 2019 4:00 pm

Hashimi wrote:What is the best way to do this?
That depends on your technical skills and your budget. If I had to do this, I might try one of the following methods:

a) Use approximate string matching

1. Translate the whole list with DeepL to English and save the Spanish and English words in a spreadsheet file.
2. Use the Levenstein distance or a proper Fuzzy Matching algorithm to calculate the similarity between the two words. E.g., manera <=> manner; Levenstein distance = 2; similarity ≈ 67%. (Words that start with the same letter and have a similarity of ≥ 65% are most likely cognates.)

b) Use an existing cognate list

1. Several text books contain English-Spanish and/or Spanish-English cognate lists, for example, Resnick, Seymore - Essential Spanish Grammar.
There are also lots of cognate lists on the Internet, for example, this one.
2. It should be relatively easy to OCR them and to check the entries in the RAE list against them.

c) Outsource the task to low-cost websites, e.g. Amazon Mechanical Turk or fiverr.
Last edited by Doitsujin on Sat Mar 09, 2019 4:49 pm, edited 1 time in total.
7 x

User avatar
Querneus
Blue Belt
Posts: 836
Joined: Thu Dec 01, 2016 5:28 am
Location: Vancouver, Canada
Languages: Speaks: Spanish (N), English
Studying: Latin, French, Mandarin
x 2269

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby Querneus » Sat Mar 09, 2019 4:34 pm

I'd just like to mention that doing it programmatically using strategy b) is actually harder than you outline, Doitsujin, as the RAE's file lists the same word multiple times in different shapes, e.g. the singular noun parte at #64 and its plural partes at #665, or the verb cubrir in various conjugation forms at #2930, #3618, #4566, #5323, #7415, #7492, #9427 and #9619. For strategy b), a morphological analyzer of Spanish should therefore be used within the program in order to recognize the citation form of words, which then could be matched against the cognate database created from various sources.

It goes without saying that both strategy a) and strategy b) will also have plenty of false positives, as strategy a) relies on DeepL choosing a cognate (it may not even if there is one) and the Levenstein distance may be just under 65 for what are actually cognates, and as strategy b) has the problem that the RAE's list will have derived cognates not included in the database. The RAE's list also includes personal given names and family names such as Adolfo, Juanito, Ramírez and Menem, and things like Roman numerals (e.g. XVI at #3708).
2 x

User avatar
zenmonkey
Black Belt - 2nd Dan
Posts: 2528
Joined: Sun Jul 26, 2015 7:21 pm
Location: California, Germany and France
Languages: Spanish, English, French trilingual - German (B2/C1) on/off study: Persian, Hebrew, Tibetan, Setswana.
Some knowledge of Italian, Portuguese, Ladino, Yiddish ...
Want to tackle Tzotzil, Nahuatl
Language Log: viewtopic.php?f=15&t=859
x 7030
Contact:

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby zenmonkey » Sat Mar 09, 2019 11:42 pm

So this is intended to go and eliminate word like "bolsa", "avergonzado", "en" and "bolsillo" ? Cognates in English are respectively "bourse", "verecund", "in" and a lemma of bolsa...

keeping snow and nieve? Both cognates of nix (nom, akk. nivium).

But I want it the other way around, to highlight (or remove) the cognates in a Spanish text or wordlist.


If you have a cognate list already, pop that into Excel and use a vlookup() function, if it is in the list set to true, and then just keep the false matches...
4 x
I am a leaf on the wind, watch how I soar

User avatar
zenmonkey
Black Belt - 2nd Dan
Posts: 2528
Joined: Sun Jul 26, 2015 7:21 pm
Location: California, Germany and France
Languages: Spanish, English, French trilingual - German (B2/C1) on/off study: Persian, Hebrew, Tibetan, Setswana.
Some knowledge of Italian, Portuguese, Ladino, Yiddish ...
Want to tackle Tzotzil, Nahuatl
Language Log: viewtopic.php?f=15&t=859
x 7030
Contact:

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby zenmonkey » Sun Mar 10, 2019 6:39 pm

Hashimi wrote:
zenmonkey wrote:keeping snow and nieve? Both cognates of nix (nom, akk. nivium).


No. I'm talking about cognates that come directly from Latin or French, and not found in Old English or Proto-Germanic.

zenmonkey wrote:If you have a cognate list already, pop that into Excel and use a vlookup() function, if it is in the list set to true, and then just keep the false matches...


The problem is that list2 is in English, so they are not identical in form:

http://www.cognates.org/h/ecl.js


You can get the translations using a google sheet and the translate() function.
5 x
I am a leaf on the wind, watch how I soar

DaraghM
White Belt
Posts: 37
Joined: Wed Aug 17, 2016 7:57 am
x 91

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby DaraghM » Tue Mar 12, 2019 10:36 am

Can I ask why you want to do this? If the purpose it to learn vocabulary in an efficient manner, by ignoring words that seem similar, you run the risk of missing some key differences even with words that look identical to English. Just flicking through the list quickly, I see the following,

inferior - means lower in a general sense, not only to do with status, as the English word implies.
carbón - carbon, but also coal and charcoal.
oculta - rarely pertains to the occult, but commonly used to mean hide, cover, mask, etc.
3 x

User avatar
Querneus
Blue Belt
Posts: 836
Joined: Thu Dec 01, 2016 5:28 am
Location: Vancouver, Canada
Languages: Speaks: Spanish (N), English
Studying: Latin, French, Mandarin
x 2269

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby Querneus » Tue Mar 12, 2019 10:44 am

DaraghM wrote:Can I ask why you want to do this? If the purpose it to learn vocabulary in an efficient manner, by ignoring words that seem similar, you run the risk of missing some key differences even with words that look identical to English. Just flicking through the list quickly, I see the following,

inferior - means lower in a general sense, not only to do with status, as the English word implies.
carbón - carbon, but also coal and charcoal.
oculta - rarely pertains to the occult, but commonly used to mean hide, cover, mask, etc.

Right. This is also important.

While we're at it, I feel that a very good resource could be made out of this list if only someone who knows both Spanish and English could go through all entries (after merging the instances of multiple forms of the same word into single entries) and add usage notes and grammatical information (verb stems, noun gender, lexical prepositions).

Stop staring at me. It hurts.
0 x

白田龍
Orange Belt
Posts: 242
Joined: Wed Mar 21, 2018 6:54 pm
Languages: English, Portuguese, Spanish, Catalan, French, Persian, Arabic, Mandarin, Japanese.
x 444

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Postby 白田龍 » Tue Mar 12, 2019 6:28 pm

I'd parse the entries from an English and a Spanish monolingual dictionaries, to extract the Classical Latin from the etymologies, then it would be easy to filter all words that have the same origin.

Or just delete all words, 99.9% will be cognates if you go down to Proto-Indo-European 8-)
0 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: sirgregory and 2 guests