[idea] Using large corpus to create language course

crush · Postby **crush** » Fri Jan 06, 2017 4:33 am

I've been wanting to make a simple course/tool to help me in my Basque studies for a while now, i know some members have done similar things (Cainntear posted something similar not that long ago).

I've compiled a list of about 550 movie subtitles, a few short films, and another 200 or so pulled from TV shows. My idea is to write some simple scripts, preferably in Python (might make something more robust in C later), to strip the subtitles of unnecessary data (time stamps, HTML marking, etc.) and organize them somehow, perhaps putting them into a database. I'd like to preserve the order of the sentences to make it easier to view the next sentence if i wanted (eg. if the sentence was "How old are you?" I could also pull the next sentence "I'm 33."). I also want to use the corpus to build a frequency dictionary, i could write a simple script myself but there are also other more robust tools available.

So my end goal is this: a set of sentences with each word tagged with its frequency (within my small corpus and excluding words that have been capitalized to avoid proper nouns skewing things) which i can use to weigh sentences based on the frequency of their words.

I've set up a list of about 160 of the most common grammar topics that i want to cover, so i can search for terms related to that grammar topic, pull sentences with the most common vocabulary, then make a course that gradually introduces new grammar and vocab.

The issue is i'm not sure what the best way to organize the data would be, would MySQL be able to reasonably handle such a large dataset? Some searching online lead me to Apache Spark, but i'm not sure if it's really worth the trouble.

MorkTheFiddle · Postby **MorkTheFiddle** » Fri Jan 06, 2017 7:24 pm

You are planning a very ambitious and praiseworthy topic. How large is your dataset, and what will your front end be? Probably I am telling you what you already know, but I would create a working mini-model of what I wanted in whatever database is convenient, making sure that I can do with the data whatever I wanted. Here is one discussion of the potential size and flexibility of a mySql table: http://stackoverflow.com/questions/48633/maximum-table-size-for-a-mysql-database. Note that it's a rather old thread.

crush · Postby **crush** » Sat Jan 07, 2017 5:35 am

I'm going to be diving into things today after work, but i'd say a conservative estimate is that the films on average have 1,000 lines and the average TV show has around 500 lines. So that's 550*1,000 + 200*500 = approx. 650,000 lines. If the average line has 5 words, that's (a conservative) 3.25 million words.

Thanks for the ideas, i've read through that discussion and a few others on Stackoverflow and it seems i'm right around that border line. I think it's as you said, build something to first work with a smaller portion of my data and see how that goes.

Seneca · Postby **Seneca** » Sat Jan 07, 2017 2:27 pm

How will you be using this to learn Basque? Just watching the media and reading in dual subtitles?

crush · Postby **crush** » Sat Jan 07, 2017 3:53 pm

I'm working on it with a Basque friend, so my idea is that we will put together a list of sentences which introduce the most common 3-4000 words (from the corpus, at least) and cover the different grammar points that i put together (the ones at the end are still unorganized):

Code: Select all

Ikasgai   Gramatika   Mota   Informazio   Laburdurak   
1   Verb   Synth: pres   Izan   synth   synthetic
   Pronouns   Nor   ni, hura, gu, zu, zuek, haiek   sg   singular
   Case   Nor   article -a/-ak   pl   plural
2   Demonstratives   Nor: sg   hau, hori, hura   pres   present
3   Question   Zer      perf   perfect
4   Question   Yes/No      obj   object
5   Verb   Synth: pres   Ukan: sg obj (dut, du, dugu, duzu, duzue, dute)   pr   person
   Case   Nork         
6   Demonstratives   Nork: sg   honek, horrek, hark      
7   Numbers   1-10      2nd conditional   potential hypothetical
   Mugagabe            
   Question   Zenbat         
8   Numbers   11-20         
9   Demonstratives   Nor/Nork: pl   hauek, horiek, haiek      
10   Time      1:00, 2:00, 3:00, etc.      
11   Verb   Synth: pres   Egon      
12   Case   Non   (all forms: sg, pl, muga, prop. nouns)      
13   Demonstratives   Location   hemen, hor, han      
14   Verb   Compare   egon vs izan      
15   Question   zein         
   Pronouns   Possessive   nire, haren, gure, zure, zuen, haien      
16   Verb   Compound: pres perf (nor)   etorri, joan      
17   Verb   Compound: pres perf (nork)   ikusi, ikasi      
18   Verb   Synth: pres   Ukan: pl obj (ditut, ditu, ditugu, dituzu, dituzue)      
19   Case   Noren         
20   Suffix: Verb   -ta, -a/-ak   beteta, irekita, pagatuak      
21   Suffix: Noun   -rik   ez dago ardorik      
22   Case   Norekin         
23   Verb   Compound: habitual participle   afaltzen, ikasten, aurkezten      
   Verb   Compound: future   -ko/-go      
24   Verb   Synth: pres   joan      
   Verb   Synth: pres   etorri      
   Case   Nora/Nondik         
25   Verb   Nominalization: -tzea         
      Nominalization: -tzeko         
      Nominalization: -tzera         
      Nominalization: -tzean         
26   Case   Nongo         
27   Verb   Synth: pres   ibili      
28   Case   Norentzat         
29   Verb   Synth: pres   jakin      
30   Case   Zertaz         
31   Prefix   ba-   badakit, banoa, etc.      
32   Verb   Nor-Nori (3rd pr)   gustatu, iruditu      
   Case   Nori         
   Pronouns   Nori         
   Demonstratives   Nori         
   Determiners   Nori   bat, batzu, asko, gutxi, zenbat, zer, zein      
33   Verb   Compound: continuous   ari, egon, ibili      
34   Determiners   non, nora, nondik   bat, batzu, asko, gutxi, zenbat, zer, zein      
35   Demonstratives   non, nora, nondik   hau, hori, hura, hauek, horiek, haiek      
36   Verb   Modal   nahi, behar      
37   Determiners   Nork   bat, batzu, asko, gutxi, zenbat, zer, zein      
   Determiners   Noren   bat, batzu, asko, gutxi, zenbat, zer, zein      
38   Phrase   baino lehen, eta gero, aurretik, ondoren         
39   Vocab   Months         
40   Verb   Other uses of future   (guessing, proposing idea)      
41   Vocab   Weather   ari du, egin du      
         gose naiz vs goseak nago, hotz naiz/hotzak nago      
   Exclamation   Ze, Hau   ze hotza!, Hau beroa!      
42   Verb      Jakin/Ikasi + -tzen      
   Adverb   Gabe   With verbs and nouns      
43   Verb   Intransitive/Transitive   Verbs with diff. meanings (altzatu du, altxatu da)      
      Passive   Using izan with transitive verbs      
44   Adverb   Location: non   atzean, artean, azpian, etc.      
   Case   Noraino         
45   Verb   Imperative   Ekarri! Esan!      
46   Verb   Compound sentences   Barrura sartu eta ikusten dute. (first verb is dictionary form)      
47   Adverb   Location: nora, nondik   atzean, artean, azpian, etc.      
48   Verb   Synth: past   izan      
   Verb   Cannot   ezin, ez jakin      
49   Suffix: Verb   -(e)la/nik   etorriko dela esan du, ez du esan etorriko denik      
   Verb   Nor-Nori-Nork (3rd pr)   diot, dio, diogu / diet, die, diegu ... etc.      
50   Verb   Nor-Nork   Full (nau, nauzu, zaitut, zaitu, gaitu, gaituzu, etc.)      
51   Suffix: Determiner   -bait   some... (zerbait, norbait, noizbait, nonbait)      
   Prefix: Determiner   -i   ezer, inor, inoiz, inon      
52   Pronouns   Reflexive   bera, bere      
   Adverb   baietz, ezetz         
53   Suffix: Verb   -(e)lako         
54   Prefix: Verb   Condition   ba-      
55   Suffix: Verb   -(e)n   "whether": Ez dakit non dagoen      
56   Verb   Nor-Nori-Nork: pres   Full      
57   Suffix: Verb   -(e)nean         
   Suffix: Noun   -txo         
58   Verb   Synth: past   ukan: 3rd pr      
59   Suffix: Verb   Reported speech   Review: -tzeko, -ela, -en      
60   Verb   Galdegaia   egin: galdu egin gara      
61   Numbers   Ordinal   lehen, bigarren, hirugarren, etc.      
62   Verb   Near Past vs Remote Past   Ikusi dut vs. ikusi nuen      
63   Verb   Nor-Nori-Nork: past   3rd pr NORI (nion, zion, genion)      
64   Suffix: Adjective   -ago, -en, -rik + -en, -etatik/etako + -en, -egi   Comparative, Superlative, baino      
65   Verb   habitual past   joaten nintzen      
66   Verb   Conditional: Nor-Nork   3rd pr: nuke, luke, genuke, etc.      
67   Verb   Conditional: Nor   nintzateke, litzateke, ginateke, etc.      
68   Verb   Hypothetic: Nor/Nor-Nork (3rd pr)   banintz, balitz, bagina/ban(it)u, bal(it)u, bagen(it)u      
69   Suffix: Noun   -gatik   because of      
70   Verb   Synth: past   egon      
71   Verb   Synth: past   etorri      
72   Verb   Synth: past   joan      
73   Verb   Synth: past   ibili      
74   Verb   Synth: past   eduki      
75   Verb   Synth: past   jakin      
76   Suffix: Verb   -nez   Nire ama esaten duenez...      
77   Article   Inclusive   -ok      
78   Verb   Future-past   conjectures in the past: Non dago? Ez dakit, norbaitek hartuko zuen.      
79   Verb   Nor-Nori-Nork: past   Full      
80   Suffix: Verb   -ez (gero)   Hori esanez, Euskara ikasiz gero      
   Suffix: Verb   -ela(rik) (circumstantial)   As/When ...: Bere etxera zatorrela, basotik zihoala, etc.      
81   Phrase   Ez ..., ... baizik         
82   Suffix: Verb   -en (relative clauses)   joan den gizona, ikusi duzun haurra, nahi duena egiten du      
83   Verb   Synth: pres potential (nor)   naiteke, daiteke, gaitezke, etc.      
84   Phrase   nahiz eta, arren, ba... ere   nahiz eta nekatuta egon, nekatuta dagoen arren, nekatuta badago ere      
85   Verb   Synth: pres potential (nor-nork)   3rd pr: dezaket, dezake, dezakegu      
86   Case   Zertarako (?)   Lore hau etxerako erosi ditut      
   Case   norengan, norengana, norengandik         
87   Adverb   Comparison: bezain, beste   Ni zu bezain indartsua naiz, nik ez dut zu beste irabazten      
   Adverb   hain, hainbeste   Ez da hain polita      
88   Verb   Indirect commands: -t(z)ea   Zuk hori esatea nahi dut      
89   Verb   Potential   Ahal      
90   Verb   Synth: 2nd conditional (nor)   ninteke, liteke, gintezke      
   Verb   Synth: 2nd conditional (nor-nork)   3rd pr: nezake, lezake, genezake      
91   Case   Norantz   Toward: iparralderantz      
92   Particle   al   etorri al da?      
   Suffix: Verb   -a (perfect)   iritsia naiz, etorriak gara, ikusiak dituzu      
93   Verb   Imperative: Nor   zaitez, zaitezte, gaitezen      
   Verb   Imperative: Nor-Nork   ezazu, ezazue, dezagun      
   Verb   Synth: Imperative   zatoz(te), zoaz(te), zaude(te), emazu(e), ekarzu(e), esazu(e)      
94   Verb   Synth: pres   esan (diot, dio, diogu)      
   Verb   Synth: past   esan (nioen, zioen, genioen)      
95   Suffix: Verb   Object of nominalization   etxea erostea/etxe      
   Verb   Synth: pres   irudi: dirudit, dirudi, dirudigu, etc      
96   Word formation      tasun, keria, keta, (tzai)le, zale, tar, tegi, dun, etc.      
97   Particle   ote         
   Particle   omen         
   Particle   ohi         
98   Verb   Participle: -a, -ta, -(r)ik   irekia dago, irekita dago, irekirik dago      
   Suffix: Verb   -tako, -(r)iko   galdutako galtzeak      
99   Ellipsis      Leaving out the verb      
100   Adjective/Adverb   repetition   polit-polita, bakar-bakarra      
101   Phrase   gero eta -ago, ahalik eta -en   Gero eta handiagoa, ahalik eta handiena      
               
               
               
   Verb   2nd conditional (nor-nork)   Full (nintzake, nintzakezu, gintzake, gintzakezu)      
   Verb   2nd conditional (nor-nori)   Full (nenkizkizuke, lekidake, lekizkizuekete)      
   Verb   2nd conditional (nor-nori-nork)   Full (niezaioke, niezazuke, niezazuekete)      
   Verb   Potential: past   same as 2nd conditional +(e)n?      
   Verb   Conditional: Nor-Nork   Full (ninduke, nindukezu, zintuzket, etc.)      
   Verb   Conditional: Nor-Nori   Full (nintzaioke, nintzaizkizuke, litzaioke, etc.)      
   Verb   Conditional: Nor-Nori-Nork   Full (nizuke, nizuekete, nioke, lidake, etc.)      
   Verb   Subjunctive: Nor   nadin/la, dadin/la, gaitezen/la      
   Verb   Subjunctive: Nor-Nork   dezadan/la, dezan/la      
   Phrase   orain dela..., duela ...   ... ago      
   Pronouns   hi         
   Verb   Imperative: Nor-Nork   Full (including subjunctive forms)      
   Verb   Imperative: Nor-Nori-Nork   Full (iezadazu, iezaiozu, etc.)      
   Verb   Imperative: Nor-Nori   Used?? zakizkit, zakizkigute      
   Verb   Synth: Imperative   Eman: Nor-Nori-Nork      
   Verb   Synth: Imperative   Eman: Nor-Nori-Nork      
   Suffix: Verb   bait-   since, as: etorri baita      
   Conjunction   Edo vs Ala         
   Adverb   ere   Baita ... ere, ... ere bai, etc.      
   Pronouns   neu, zeu, etc         
   Verb   participle + izan da, izan zen         
   Time   Partial hours   1:30, 2:15, 20 to 4, etc.      
   Verb   Subjunctive: Past   All forms

Then i'd like to find/hire someone to record all the sentences. I'm not sure where exactly to go from there, i've been thinking of putting together an android app with detailed information on all the sentences and lets you group them together to make an audio Pimsleur sorta deal, but with the audio i've already got enough to use as a simple English->Basque audio course. There aren't really any audio courses for Basque that i know of, so i'm hoping i can help fill that gap and provide the course and tools i used to put it together for free.

Cainntear · Postby **Cainntear** » Sat Jan 07, 2017 9:20 pm

The big problem with a corpus-based programme is that you're working with someone else's data, and that's technically not allowed, so you could get shut down at any moment from any number of people involved in any of the DVDs you're working from. I'd be very wary about putting a lot of work into creating something, even something non-commercial, that breaches someone else's copyright.

One piece of advice I have is a tip I picked up from machine learning classes: using probabilities for scoring/rating difficulty.

Consider:
If 0 is "the learner has no chance of getting it right" and 1 is "the learner is certain to get it right", then it becomes easy to combine the ratings of components, by multiplying the probabilities -- multiplying any number of probabilities will never lead to a higher number than the lowest of them, so we are guaranteed to tag complex tasks as being more difficult than simpler ones.

The most naive algorithm would be to rate individual words, and have the final rating of the sentence as just the product of all the words.

e.g.
P(begi) = 0.75
P(politenak) = 0.5
P(dituzu) = 0.6

P(begi politenak) = P(begi) * P(politenak) = 0.375
P(begi politenak ditugu) = P(begi) * P(politenak) * P(dituzu) = 0.225

This is obviously just an approximation. First up, it doesn't need to be a true probability (it wouldn't be practical to count how many times they get it right/wrong, because we assume they learn it better each time they're exposed to the word), we just need the difference between hard and easy things to be relatively consistent.

Also, because I've used a naive algorithm, fragments aren't penalised compared to full sentences, even though full sentences are generally easier to process, but it's up to you to sort out your own scoring algorithm as best fits your data.

I would say, though, that it's a good idea to work modularly and start off with a fairly simple algorithm. A well engineered Python program will let you swap scoring algorithms later, so I recommend keeping it simple to start off with rather than getting bogged down in the best possible scoring -- once the rest of the software is established, you can start improving the algorithm if you need to (and if you design for modularity then you can even A-B test two or more different algorithms against each other).

crush · Postby **crush** » Sun Jan 08, 2017 5:26 pm

I'm not too concerned about the copyright stuff, at most i'd be pulling maybe 10 sentences from an entire movie, changing/removing names, and sometimes even changing the sentence around. I don't think you'd even know which movie it was from. And courses like Duolingo commonly use famous phrases from movies as well. Not to mention that these subtitles are basically all fansubs (for better or for worse) of movies that have no Basque audio.

My idea is similar to yours, except i was planning on using word frequency (within the comparatively small corpus) and taking the average frequency of all words. Taking your example, "begi politenak" and "begi politenak dituzu" would be sorted according to the frequency of dituzu. If dituzu is more common than the average of "begi politenak" then it would have a lower (better) score, if it were less common, then it would have a higher (worse) score. I'm not sure if it's the best way or not, i like your difficulty rating idea but ideally that would also do some parsing of the sentence to check for advanced grammar (mostly verb usage) which is perhaps a bit beyond the scope of this project. But i plan on hand picking these sentences anyway so it's not too big of a deal. My end goal is to get the sentences professionally recorded and hopefully released under a free license so others can use them how they like. Personally i want to build them into an audio course of sorts. I've got an idea for an app, but before getting carried away i just want to get the base built.

Thanks a lot for your input!

DangerDave2010 · Postby **DangerDave2010** » Sun Jan 08, 2017 7:59 pm

I keep my corpora on a big TXT file, that has been processed so that every sentence is separated by u'\n'. My index file now uses Python's sqlite3. Each unique wordform is mapped to a list of tuples indicating the offset of its sentence in the text file, and its sequence within that sentence.

(Your data is very small, you don't need to worry about databases and stuff, you may just load everything to memory)

This is my indexer:

Code: Select all

#encoding:utf8
import unicodedata as unicodedata
#from cPickle import *
import struct
import os
import sqlite3


#The file to be indexed needs to have its sentences separated by u'\n'.


tdict = {}#Character normalisation dictionary to be used with str.translate()



def indexSubs(filePath):
   global wordIndex
   print('making index...')

   
   counter = 0
   counter2 = 0

   with open (filePath, 'rb') as f:
      pos = 0
      for line in f:
         words = tokenise(line.decode('utf8', 'ignore'))
         for n, w in enumerate(words):
            if n > 255:
               continue
            if w not in wordIndex:
               wordIndex[w] = []
            wordIndex[w].append(pos)
            wordIndex[w].append(n)
         pos += len(line)
         counter += 1
         counter2 +=1
         if counter >=50000:
            print(counter2)
            counter = 0
            
         
            
   #with open (filePath + '.pkl', 'wb') as f:
   #   dump(wordIndex, f , HIGHEST_PROTOCOL)
   #print('dumped index', filePath + '.pkl')

   
   
def deDiacritic(text):
   return u''.join([a for a in text if unicodedata.category(a) not in ('Mn', 'Lm')])

   
neuterCats = set(('Mn', 'Lm'))
def tokenise(verse):
   "returns a list of words from a sentence"
   #if this is changed, indexing should be redone
   
   verse = verse.translate(tdict)

   curword = []
   wordlist = []
   

   for chr in verse:
      if unicodedata.category(chr) in neuterCats:
         continue
      if unicodedata.category(chr).startswith('L'):
         curword.append(chr)
      else:
         if len(curword) >= 1:
            wordlist.append(u''.join(curword))
            curword = []
            
   if len(curword) > 0:
      wordlist.append(u''.join(curword))
   return wordlist



def packIdx(seq):
   return struct.pack('>' + 'IB'*(len(seq)/2), *seq)
def unpackIdx(mydata):
   return struct.unpack('>' + 'IB'*len(mydata)/5, mydata)



for learnSubsFile in os.listdir(u'.'):
   #higFreq = set()#not used
   wordIndex = {}

   indexSubs(learnSubsFile)
   print('finished compiling index, attempting to save to database')

   db_file = learnSubsFile +'.sql'
   with open( db_file, 'wb') as f:
      pass
   conn = sqlite3.connect(db_file)
   c = conn.cursor()
   sql = '''create table WORDINDEX( 
      WORD TEXT PRIMARY KEY,
      LOCATIONS BLOB);'''
   c.execute(sql)

   sql = '''INSERT INTO WORDINDEX(WORD, LOCATIONS) 
         VALUES(?, ?);'''
   for n, word in enumerate(wordIndex):
      if n % 50000 == 0:
         print(learnSubsFile, wordIndex[word])
         #conn.commit()
      conn.execute(sql,[word, sqlite3.Binary(packIdx(wordIndex[word]))])
   conn.commit()
   conn.close()

My operational version works with Arabic script, I have quickly wrote some modifications so that this may work with Latin script language, but I have not tested it at all.

Here's my fetcher:

Code: Select all

#encoding:utf8
import unicodedata as unicodedata
import struct
import sqlite3
from random import random





tdict = {}#language specific spelling normaliser

neuterCats = set(('Mn', 'Lm'))

def tokenise(verse):
   "returns a list of words from a sentence"
   #if this is changed, indexing should be redone
   
   verse = verse.translate(tdict)

   curword = []
   wordlist = []
   

   for chr in verse:
      if unicodedata.category(chr) in neuterCats:
         continue
      if unicodedata.category(chr).startswith('L'):
         curword.append(chr)
      else:
         if len(curword) >= 1:
            wordlist.append(u''.join(curword))
            curword = []
            
   if len(curword) > 0:
      wordlist.append(u''.join(curword))
   return wordlist
      
   

def mixLists(listOfLists):
   l = []
   while sum([len(a) for a in listOfLists]) > 0:
      for a in listOfLists:
         if len(a) > 0:
            l.append(a.pop(0))
   return l
   
   
def packIdx(seq):
   return struct.pack('>' + 'IB'*(len(seq)/2), *seq)   
def unpackIdx(data):
   return struct.unpack('>' + 'IB'*(len(data)/5), data)
def unpackBuffer(data):
   return struct.unpack_from('>' + 'IB'*(len(data)/5), data)


class corpusSearch():
   def __init__(self, filename):
      self.learnSubsFile = homeDir + filename
      self.indexFile = self.learnSubsFile +'.sql'
      self.filename = filename
      
      self.conn = sqlite3.connect(self.indexFile)
      self.c = self.conn.cursor()

      self.corpusText = open(self.learnSubsFile, 'rb')
   
   def getOneWord(self, word):
      word = word.translate(tdict).strip()
      entry = list(self.c.execute(u"SELECT * FROM WORDINDEX WHERE WORD = ?" ,[word]))
      if len(entry)> 0:
         locations =  unpackBuffer(entry[0][1])
         sent = locations[0:500:2]
         return sent
      else:
         return []
   
   def getWordLoc(self, word):
      word = word.translate(tdict).strip()
      entry = list(self.c.execute(u"SELECT * FROM WORDINDEX WHERE WORD = ?" ,[word]))
      if len(entry)> 0:
         locations =  unpackBuffer(entry[0][1])
         sent = locations[0:-1:2]
         seq = locations[1::2]
         loc = set(zip(sent,seq))

         return loc
      else:
         return set()#[set(), set()]
      

   def getExprLoc(self, expr):
   
   
      wordlist = tokenise(expr)
      
      if len(wordlist) == 1:
         return self.getOneWord(wordlist[0])
         
         

      wordLocs = [ self.getWordLoc(a) for a in wordlist]
      
      wordLocs = sorted(enumerate(wordLocs), key = lambda a: len(a[1]))
      firstLoc = wordLocs[0][1]
      firstLoc = {(a[0], a[1] - wordLocs[0][0]) for a in wordLocs[0][1]}
      
      for nextLoc in wordLocs[1:]:
         firstLoc = { (a[0], a[1] + nextLoc[0]) for a in firstLoc }
         firstLoc = firstLoc.intersection(nextLoc[1])
         if nextLoc == wordLocs[-1]:
            break
         firstLoc = {(a[0], a[1] - nextLoc[0]) for a in firstLoc}

      return [a[0] for a in firstLoc]   
      


   def getSentences(self, expr):
      result = []
      locations = sorted(self.getExprLoc(expr), key = lambda a: random())[:40]
      for a in sorted(locations):
         self.corpusText.seek(a)
         result.append(self.corpusText.readline().decode('utf8').strip() + u'\n' + self.filename + u' ' +str(len(locations)))
      return sorted(result, key = lambda a: random())
   

coll = corpusSearch('colloquialNovels.u8')
form = corpusSearch('formalNovels.u8')
subs =  corpusSearch('OpenSubtitles2016.en-fa.fa')

def  getExpr(word):
   sents = []
   for corpus in [ coll, form, subs]:
      sents += corpus.getSentences(word)
   return sorted([[a] for a in sents], key = lambda a: random())

Again, I have quickly edited it to remove some language specific code, and have not tested it at all.

I got my subs from http://opus.lingfil.uu.se/OpenSubtitles2016.php

On a similar project, I had a function that got a sentence, and returned the frequency of the rarest word in that sentence, then I sorted the corpus by this value. But I see I should have made arrangements to learn concrete words earlier. When I started picking sentences with words from the bottom of the frequency list, I ended up with a salad pronouns, prepositions, conjunctions, auxiliary verbs... that was really baffling.

Cainntear · Postby **Cainntear** » Sun Jan 08, 2017 10:44 pm

crush wrote:My idea is similar to yours, except i was planning on using word frequency (within the comparatively small corpus) and taking the average frequency of all words. Taking your example, "begi politenak" and "begi politenak dituzu" would be sorted according to the frequency of dituzu. If dituzu is more common than the average of "begi politenak" then it would have a lower (better) score, if it were less common, then it would have a higher (worse) score. I'm not sure if it's the best way or not, i like your difficulty rating idea but ideally that would also do some parsing of the sentence to check for advanced grammar (mostly verb usage) which is perhaps a bit beyond the scope of this project.

Ah, but if you use a generic scoring, you can substitute for a different algorithm later -- start with a simple one, then improve.

Can I suggest a slight variation...? I was talking about ptobabilities, but if you're looking at frequencies, you could do fractions.

The score of a word/token will then be the number of times it occurs divided by the total number of words in the corpus, giving a number between 0 and 1, with higher numbers being more frequent. This means all your numbers are going to be in the lower ranges, and it might seem a bit silly, but there's a reason the experts do it this way.

And actually, frequency is probability anyway - if "eta" is 1/10 of all words in your corpus, you've got a 1:10 chance of getting "eta" if you pick a word at random.)

I'm not convinced that averaging your scores is a good idea. Imagine you have a sentence like the following (in English, for clarity):
In my cupboard there are three big green postillions
Because there are so many common words in the sentence, the fact that there is one extremely rare word in there is completely obscured from the scoring.

Now I know that's an extreme example, and you're not going to include examples with such rare vocab, but I'm just trying to demonstrate the concept.

Also, if you're using averages of frequency, then the following sentence is actually ranked as easier/better than the shorter, simpler one above:
I think he said that there were three big green postillions in his cupboard.

...and that problem is a particular issue when going by frequency, because conjunctions are very common words, but only occur in complex sentences, and so you're going to inadvertently favour those complex sentences.

I suppose I should admit that I actually made the same mistake myself, and it was immediately obvious that I was just getting the longest sentences in my test set every time... and most often complex sentences with "chi" (that).

Anyhow, as a general rule, a task should be considered at least as difficult as the most difficult element in it.

crush · Postby **crush** » Mon Jan 09, 2017 3:09 pm

DangerDave2010 wrote:I keep my corpora on a big TXT file, that has been processed so that every sentence is separated by u'\n'. My index file now uses Python's sqlite3. Each unique wordform is mapped to a list of tuples indicating the offset of its sentence in the text file, and its sequence within that sentence.

(Your data is very small, you don't need to worry about databases and stuff, you may just load everything to memory)

Thanks for your input. I just got things mostly cleaned up (removed those "translated by" lines and other similar things) and it turns out that my original estimate wasn't too far off, i've got 641,345 lines of dialog and just under 3.2 million words.

Cainntear wrote:I'm not convinced that averaging your scores is a good idea. Imagine you have a sentence like the following (in English, for clarity):
In my cupboard there are three big green postillions
Because there are so many common words in the sentence, the fact that there is one extremely rare word in there is completely obscured from the scoring.
...
Anyhow, as a general rule, a task should be considered at least as difficult as the most difficult element in it.

That's true and definitely something i'll have to put some thought into, for now though i think i'll be hand-picking sentences anyway, so if there are words that are too difficult/rare, i can replace them with more common words myself. I'm mostly looking to build a corpus of grammatical sentences (though there are always typos and so on) that i can use to search for sentences with the grammar point i'm looking to cover. It can all be fine-tuned as i get further along, i suppose.

EDIT: My friend just showed me the Goenkale (a Basque TV show) Corpus, with 20 years' worth of episodes (2,994 episodes altogether). You can search it by morphology (case, tense, verb type, mood, even whether the hika form is male or female). The only thing is that it doesn't give frequency information, but this is the sort of tool i had in mind for this project. The Egungo Testuen Corpusa does have frequency data, though. It's from post-2000 literature.

EDIT2: Just noticed the "maiztasunak" tab, so there is frequency data as well! Also, words are organized by lemma.

A language learners’ forum

[idea] Using large corpus to create language course

[idea] Using large corpus to create language course

Re: [idea] Using large corpus to create language course

Re: [idea] Using large corpus to create language course

Re: [idea] Using large corpus to create language course

Re: [idea] Using large corpus to create language course

Re: [idea] Using large corpus to create language course

Re: [idea] Using large corpus to create language course

Re: [idea] Using large corpus to create language course

Re: [idea] Using large corpus to create language course

Re: [idea] Using large corpus to create language course

Who is online