Python: working with the Tatoeba database

Small area for language related software developers. If you have a feature request please put in the appropriate place. This area is for developers of language software and forum software development only.
Cainntear
Black Belt - 3rd Dan
Posts: 3877
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 9474
Contact:

Python: working with the Tatoeba database

Postby Cainntear » Sat Feb 04, 2017 7:47 pm

I've been writing scripts in Python to filter through the Tatoeba database.

This is pretty old-school -- no databases, every filter reads in the file, then writes out to a file.

I've used functions passed as arguments to make it easy to create new filters.

I've included the four basic filters I've been using, and I've added an example of a more complex filter at the bottom which hopefully demonstrates how to write most filters you might need -- but if there's any questions, just ask. The lambda function should take as its single argument a whole row from the detailed version of the database and return a truth value -- true means it will be included in the output file, false means it will be left out.

(I'm intending to integrate checking submitters' language skill at a later date, but this is what I've got at the moment.)

Code: Select all


# Tatoeba sentence filter

import csv

def genericFilter (filterFunc, inputFile, outputFile) :
    if outputFile==None :
        outputFile=inputFile+".2.csv"
    with open (outputFile,"w") as f_out :
        with open (inputFile,"r") as f_in :
            csv_in = csv.reader (f_in,delimiter="\t")
            csv_out = csv.writer (f_out,delimiter="\t")
            for row in csv_in :
                if filterFunc(row):
                    csv_out.writerow(row)


def filterByLanguage (lang, inputFile="sentences_detailed.csv",outputFile=None) :
    genericFilter ( lambda (row) : row[1]==lang, inputFile, outputFile )

def filterByUser (user, inputFile="sentences_detailed.csv",outputFile=None) :
    genericFilter ( lambda (row) : row[3] == user, inputFile, outputFile )

def filterHasAudio (inputFile="sentences_detailed.csv",outputFile=None,audioFile="sentences_with_audio.csv") :
    with open(audioFile,"r") as f_audio:
        csv_audio = csv.reader (f_audio,delimiter="\t")
        audio = set([ row[0] for row in csv_audio ])
    genericFilter (lambda (row) : row[0] in audio, inputFile, outputFile)

def filterNoAudio (inputFile="sentences_detailed.csv",outputFile=None,audioFile="sentences_with_audio.csv") :
    with open(audioFile,"r") as f_audio:
        csv_audio = csv.reader (f_audio,delimiter="\t")
        audio = set([ row[0] for row in csv_audio ])
    genericFilter (lambda (row) : row[0] not in audio, inputFile, outputFile)

def filterByLanguageAndUser(lang, user, inputFile="sentences_detailed.csv",outputFile=None) :
    genericFilter ( lambda (row) : row[1]==lang and row[3]==user, inputFile, outputFile )
5 x

Vedun
Orange Belt
Posts: 215
Joined: Tue Jun 21, 2016 1:36 pm
Languages: Bulgarian, English
German, Italian
Russian, Finnish
Language Log: viewtopic.php?f=15&t=3009
x 149

Re: Python: working with the Tatoeba database

Postby Vedun » Mon Feb 06, 2017 3:08 pm

I should learn Python just to get my hands on Tatoeba's DB.
0 x

Cainntear
Black Belt - 3rd Dan
Posts: 3877
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 9474
Contact:

Re: Python: working with the Tatoeba database

Postby Cainntear » Mon Feb 06, 2017 5:21 pm

Vedun wrote:I should learn Python just to get my hands on Tatoeba's DB.

You don't need Python -- you can download the files and load them into a proper database. My way's a bit of a hack.
1 x

mcthulhu
Orange Belt
Posts: 228
Joined: Sun Feb 26, 2017 4:01 pm
Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
x 590

Re: Python: working with the Tatoeba database

Postby mcthulhu » Thu Jul 20, 2017 3:45 pm

I just query the Tatoeba Web site for sentences containing a given query term (whatever text is highlighted by the user), and then show example sentences in a popup window. The URL is simple enough to construct.

Code: Select all

function searchTatoeba() {
   var lang=getISOLanguageCodeTrigraph(language);
   var term = getQueryTerm();
   var url="https://tatoeba.org/eng/sentences/search?query=" + term;
   url+="&from="+ lang + "&to=eng";
   openPopupWindow(url);
}


You'd probably get better performance from a database. I have a local SQLite sentence database (translation memory), and could do a bulk import of whatever I retrieve from Tatoeba. I will probably add that later on, though I have some concerns about how selective I want to be with this data.
1 x


Return to “Development Area”

Who is online

Users browsing this forum: No registered users and 1 guest