Page 1 of 1

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Posted: Sat Mar 09, 2019 4:00 pm
by Doitsujin
Hashimi wrote:What is the best way to do this?
That depends on your technical skills and your budget. If I had to do this, I might try one of the following methods:

a) Use approximate string matching

1. Translate the whole list with DeepL to English and save the Spanish and English words in a spreadsheet file.
2. Use the Levenstein distance or a proper Fuzzy Matching algorithm to calculate the similarity between the two words. E.g., manera <=> manner; Levenstein distance = 2; similarity ≈ 67%. (Words that start with the same letter and have a similarity of ≥ 65% are most likely cognates.)

b) Use an existing cognate list

1. Several text books contain English-Spanish and/or Spanish-English cognate lists, for example, Resnick, Seymore - Essential Spanish Grammar.
There are also lots of cognate lists on the Internet, for example, this one.
2. It should be relatively easy to OCR them and to check the entries in the RAE list against them.

c) Outsource the task to low-cost websites, e.g. Amazon Mechanical Turk or fiverr.

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Posted: Sat Mar 09, 2019 4:34 pm
by Querneus
I'd just like to mention that doing it programmatically using strategy b) is actually harder than you outline, Doitsujin, as the RAE's file lists the same word multiple times in different shapes, e.g. the singular noun parte at #64 and its plural partes at #665, or the verb cubrir in various conjugation forms at #2930, #3618, #4566, #5323, #7415, #7492, #9427 and #9619. For strategy b), a morphological analyzer of Spanish should therefore be used within the program in order to recognize the citation form of words, which then could be matched against the cognate database created from various sources.

It goes without saying that both strategy a) and strategy b) will also have plenty of false positives, as strategy a) relies on DeepL choosing a cognate (it may not even if there is one) and the Levenstein distance may be just under 65 for what are actually cognates, and as strategy b) has the problem that the RAE's list will have derived cognates not included in the database. The RAE's list also includes personal given names and family names such as Adolfo, Juanito, Ramírez and Menem, and things like Roman numerals (e.g. XVI at #3708).

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Posted: Sat Mar 09, 2019 11:42 pm
by zenmonkey
So this is intended to go and eliminate word like "bolsa", "avergonzado", "en" and "bolsillo" ? Cognates in English are respectively "bourse", "verecund", "in" and a lemma of bolsa...

keeping snow and nieve? Both cognates of nix (nom, akk. nivium).

But I want it the other way around, to highlight (or remove) the cognates in a Spanish text or wordlist.


If you have a cognate list already, pop that into Excel and use a vlookup() function, if it is in the list set to true, and then just keep the false matches...

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Posted: Sun Mar 10, 2019 6:39 pm
by zenmonkey
Hashimi wrote:
zenmonkey wrote:keeping snow and nieve? Both cognates of nix (nom, akk. nivium).


No. I'm talking about cognates that come directly from Latin or French, and not found in Old English or Proto-Germanic.

zenmonkey wrote:If you have a cognate list already, pop that into Excel and use a vlookup() function, if it is in the list set to true, and then just keep the false matches...


The problem is that list2 is in English, so they are not identical in form:

http://www.cognates.org/h/ecl.js


You can get the translations using a google sheet and the translate() function.

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Posted: Tue Mar 12, 2019 10:36 am
by DaraghM
Can I ask why you want to do this? If the purpose it to learn vocabulary in an efficient manner, by ignoring words that seem similar, you run the risk of missing some key differences even with words that look identical to English. Just flicking through the list quickly, I see the following,

inferior - means lower in a general sense, not only to do with status, as the English word implies.
carbón - carbon, but also coal and charcoal.
oculta - rarely pertains to the occult, but commonly used to mean hide, cover, mask, etc.

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Posted: Tue Mar 12, 2019 10:44 am
by Querneus
DaraghM wrote:Can I ask why you want to do this? If the purpose it to learn vocabulary in an efficient manner, by ignoring words that seem similar, you run the risk of missing some key differences even with words that look identical to English. Just flicking through the list quickly, I see the following,

inferior - means lower in a general sense, not only to do with status, as the English word implies.
carbón - carbon, but also coal and charcoal.
oculta - rarely pertains to the occult, but commonly used to mean hide, cover, mask, etc.

Right. This is also important.

While we're at it, I feel that a very good resource could be made out of this list if only someone who knows both Spanish and English could go through all entries (after merging the instances of multiple forms of the same word into single entries) and add usage notes and grammatical information (verb stems, noun gender, lexical prepositions).

Stop staring at me. It hurts.

Re: Is there a way to remove all words which have English cognates from this Spanish list?

Posted: Tue Mar 12, 2019 6:28 pm
by 白田龍
I'd parse the entries from an English and a Spanish monolingual dictionaries, to extract the Classical Latin from the etymologies, then it would be easy to filter all words that have the same origin.

Or just delete all words, 99.9% will be cognates if you go down to Proto-Indo-European 8-)