I've been wanting to make a simple course/tool to help me in my Basque studies for a while now, i know some members have done similar things (Cainntear posted something similar not that long ago).
I've compiled a list of about 550 movie subtitles, a few short films, and another 200 or so pulled from TV shows. My idea is to write some simple scripts, preferably in Python (might make something more robust in C later), to strip the subtitles of unnecessary data (time stamps, HTML marking, etc.) and organize them somehow, perhaps putting them into a database. I'd like to preserve the order of the sentences to make it easier to view the next sentence if i wanted (eg. if the sentence was "How old are you?" I could also pull the next sentence "I'm 33."). I also want to use the corpus to build a frequency dictionary, i could write a simple script myself but there are also other more robust tools available.
So my end goal is this: a set of sentences with each word tagged with its frequency (within my small corpus and excluding words that have been capitalized to avoid proper nouns skewing things) which i can use to weigh sentences based on the frequency of their words.
I've set up a list of about 160 of the most common grammar topics that i want to cover, so i can search for terms related to that grammar topic, pull sentences with the most common vocabulary, then make a course that gradually introduces new grammar and vocab.
The issue is i'm not sure what the best way to organize the data would be, would MySQL be able to reasonably handle such a large dataset? Some searching online lead me to Apache Spark, but i'm not sure if it's really worth the trouble.
[idea] Using large corpus to create language course
-
- Blue Belt
- Posts: 514
- Joined: Mon Nov 30, 2015 3:35 pm
- Languages: EN (N), ES, ZH
Maintain: EUS, YUE, JP, HAW
Study: TGL, SV
On Hold: RU - x 953
- MorkTheFiddle
- Black Belt - 2nd Dan
- Posts: 2141
- Joined: Sat Jul 18, 2015 8:59 pm
- Location: North Texas USA
- Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
- Language Log: https://forum.language-learners.org/vie ... 11#p133911
- x 4883
Re: [idea] Using large corpus to create language course
You are planning a very ambitious and praiseworthy topic. How large is your dataset, and what will your front end be? Probably I am telling you what you already know, but I would create a working mini-model of what I wanted in whatever database is convenient, making sure that I can do with the data whatever I wanted. Here is one discussion of the potential size and flexibility of a mySql table: http://stackoverflow.com/questions/48633/maximum-table-size-for-a-mysql-database. Note that it's a rather old thread.
1 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson
-
- Blue Belt
- Posts: 514
- Joined: Mon Nov 30, 2015 3:35 pm
- Languages: EN (N), ES, ZH
Maintain: EUS, YUE, JP, HAW
Study: TGL, SV
On Hold: RU - x 953
Re: [idea] Using large corpus to create language course
I'm going to be diving into things today after work, but i'd say a conservative estimate is that the films on average have 1,000 lines and the average TV show has around 500 lines. So that's 550*1,000 + 200*500 = approx. 650,000 lines. If the average line has 5 words, that's (a conservative) 3.25 million words.
Thanks for the ideas, i've read through that discussion and a few others on Stackoverflow and it seems i'm right around that border line. I think it's as you said, build something to first work with a smaller portion of my data and see how that goes.
Thanks for the ideas, i've read through that discussion and a few others on Stackoverflow and it seems i'm right around that border line. I think it's as you said, build something to first work with a smaller portion of my data and see how that goes.
0 x
- Seneca
- Green Belt
- Posts: 268
- Joined: Sat Jun 11, 2016 5:08 pm
- Location: Eurasia
- Languages: English (N); 日本語 (beginner)
- x 352
Re: [idea] Using large corpus to create language course
How will you be using this to learn Basque? Just watching the media and reading in dual subtitles?
0 x
-
- Blue Belt
- Posts: 514
- Joined: Mon Nov 30, 2015 3:35 pm
- Languages: EN (N), ES, ZH
Maintain: EUS, YUE, JP, HAW
Study: TGL, SV
On Hold: RU - x 953
Re: [idea] Using large corpus to create language course
I'm working on it with a Basque friend, so my idea is that we will put together a list of sentences which introduce the most common 3-4000 words (from the corpus, at least) and cover the different grammar points that i put together (the ones at the end are still unorganized):
Then i'd like to find/hire someone to record all the sentences. I'm not sure where exactly to go from there, i've been thinking of putting together an android app with detailed information on all the sentences and lets you group them together to make an audio Pimsleur sorta deal, but with the audio i've already got enough to use as a simple English->Basque audio course. There aren't really any audio courses for Basque that i know of, so i'm hoping i can help fill that gap and provide the course and tools i used to put it together for free.
Code: Select all
Ikasgai Gramatika Mota Informazio Laburdurak
1 Verb Synth: pres Izan synth synthetic
Pronouns Nor ni, hura, gu, zu, zuek, haiek sg singular
Case Nor article -a/-ak pl plural
2 Demonstratives Nor: sg hau, hori, hura pres present
3 Question Zer perf perfect
4 Question Yes/No obj object
5 Verb Synth: pres Ukan: sg obj (dut, du, dugu, duzu, duzue, dute) pr person
Case Nork
6 Demonstratives Nork: sg honek, horrek, hark
7 Numbers 1-10 2nd conditional potential hypothetical
Mugagabe
Question Zenbat
8 Numbers 11-20
9 Demonstratives Nor/Nork: pl hauek, horiek, haiek
10 Time 1:00, 2:00, 3:00, etc.
11 Verb Synth: pres Egon
12 Case Non (all forms: sg, pl, muga, prop. nouns)
13 Demonstratives Location hemen, hor, han
14 Verb Compare egon vs izan
15 Question zein
Pronouns Possessive nire, haren, gure, zure, zuen, haien
16 Verb Compound: pres perf (nor) etorri, joan
17 Verb Compound: pres perf (nork) ikusi, ikasi
18 Verb Synth: pres Ukan: pl obj (ditut, ditu, ditugu, dituzu, dituzue)
19 Case Noren
20 Suffix: Verb -ta, -a/-ak beteta, irekita, pagatuak
21 Suffix: Noun -rik ez dago ardorik
22 Case Norekin
23 Verb Compound: habitual participle afaltzen, ikasten, aurkezten
Verb Compound: future -ko/-go
24 Verb Synth: pres joan
Verb Synth: pres etorri
Case Nora/Nondik
25 Verb Nominalization: -tzea
Nominalization: -tzeko
Nominalization: -tzera
Nominalization: -tzean
26 Case Nongo
27 Verb Synth: pres ibili
28 Case Norentzat
29 Verb Synth: pres jakin
30 Case Zertaz
31 Prefix ba- badakit, banoa, etc.
32 Verb Nor-Nori (3rd pr) gustatu, iruditu
Case Nori
Pronouns Nori
Demonstratives Nori
Determiners Nori bat, batzu, asko, gutxi, zenbat, zer, zein
33 Verb Compound: continuous ari, egon, ibili
34 Determiners non, nora, nondik bat, batzu, asko, gutxi, zenbat, zer, zein
35 Demonstratives non, nora, nondik hau, hori, hura, hauek, horiek, haiek
36 Verb Modal nahi, behar
37 Determiners Nork bat, batzu, asko, gutxi, zenbat, zer, zein
Determiners Noren bat, batzu, asko, gutxi, zenbat, zer, zein
38 Phrase baino lehen, eta gero, aurretik, ondoren
39 Vocab Months
40 Verb Other uses of future (guessing, proposing idea)
41 Vocab Weather ari du, egin du
gose naiz vs goseak nago, hotz naiz/hotzak nago
Exclamation Ze, Hau ze hotza!, Hau beroa!
42 Verb Jakin/Ikasi + -tzen
Adverb Gabe With verbs and nouns
43 Verb Intransitive/Transitive Verbs with diff. meanings (altzatu du, altxatu da)
Passive Using izan with transitive verbs
44 Adverb Location: non atzean, artean, azpian, etc.
Case Noraino
45 Verb Imperative Ekarri! Esan!
46 Verb Compound sentences Barrura sartu eta ikusten dute. (first verb is dictionary form)
47 Adverb Location: nora, nondik atzean, artean, azpian, etc.
48 Verb Synth: past izan
Verb Cannot ezin, ez jakin
49 Suffix: Verb -(e)la/nik etorriko dela esan du, ez du esan etorriko denik
Verb Nor-Nori-Nork (3rd pr) diot, dio, diogu / diet, die, diegu ... etc.
50 Verb Nor-Nork Full (nau, nauzu, zaitut, zaitu, gaitu, gaituzu, etc.)
51 Suffix: Determiner -bait some... (zerbait, norbait, noizbait, nonbait)
Prefix: Determiner -i ezer, inor, inoiz, inon
52 Pronouns Reflexive bera, bere
Adverb baietz, ezetz
53 Suffix: Verb -(e)lako
54 Prefix: Verb Condition ba-
55 Suffix: Verb -(e)n "whether": Ez dakit non dagoen
56 Verb Nor-Nori-Nork: pres Full
57 Suffix: Verb -(e)nean
Suffix: Noun -txo
58 Verb Synth: past ukan: 3rd pr
59 Suffix: Verb Reported speech Review: -tzeko, -ela, -en
60 Verb Galdegaia egin: galdu egin gara
61 Numbers Ordinal lehen, bigarren, hirugarren, etc.
62 Verb Near Past vs Remote Past Ikusi dut vs. ikusi nuen
63 Verb Nor-Nori-Nork: past 3rd pr NORI (nion, zion, genion)
64 Suffix: Adjective -ago, -en, -rik + -en, -etatik/etako + -en, -egi Comparative, Superlative, baino
65 Verb habitual past joaten nintzen
66 Verb Conditional: Nor-Nork 3rd pr: nuke, luke, genuke, etc.
67 Verb Conditional: Nor nintzateke, litzateke, ginateke, etc.
68 Verb Hypothetic: Nor/Nor-Nork (3rd pr) banintz, balitz, bagina/ban(it)u, bal(it)u, bagen(it)u
69 Suffix: Noun -gatik because of
70 Verb Synth: past egon
71 Verb Synth: past etorri
72 Verb Synth: past joan
73 Verb Synth: past ibili
74 Verb Synth: past eduki
75 Verb Synth: past jakin
76 Suffix: Verb -nez Nire ama esaten duenez...
77 Article Inclusive -ok
78 Verb Future-past conjectures in the past: Non dago? Ez dakit, norbaitek hartuko zuen.
79 Verb Nor-Nori-Nork: past Full
80 Suffix: Verb -ez (gero) Hori esanez, Euskara ikasiz gero
Suffix: Verb -ela(rik) (circumstantial) As/When ...: Bere etxera zatorrela, basotik zihoala, etc.
81 Phrase Ez ..., ... baizik
82 Suffix: Verb -en (relative clauses) joan den gizona, ikusi duzun haurra, nahi duena egiten du
83 Verb Synth: pres potential (nor) naiteke, daiteke, gaitezke, etc.
84 Phrase nahiz eta, arren, ba... ere nahiz eta nekatuta egon, nekatuta dagoen arren, nekatuta badago ere
85 Verb Synth: pres potential (nor-nork) 3rd pr: dezaket, dezake, dezakegu
86 Case Zertarako (?) Lore hau etxerako erosi ditut
Case norengan, norengana, norengandik
87 Adverb Comparison: bezain, beste Ni zu bezain indartsua naiz, nik ez dut zu beste irabazten
Adverb hain, hainbeste Ez da hain polita
88 Verb Indirect commands: -t(z)ea Zuk hori esatea nahi dut
89 Verb Potential Ahal
90 Verb Synth: 2nd conditional (nor) ninteke, liteke, gintezke
Verb Synth: 2nd conditional (nor-nork) 3rd pr: nezake, lezake, genezake
91 Case Norantz Toward: iparralderantz
92 Particle al etorri al da?
Suffix: Verb -a (perfect) iritsia naiz, etorriak gara, ikusiak dituzu
93 Verb Imperative: Nor zaitez, zaitezte, gaitezen
Verb Imperative: Nor-Nork ezazu, ezazue, dezagun
Verb Synth: Imperative zatoz(te), zoaz(te), zaude(te), emazu(e), ekarzu(e), esazu(e)
94 Verb Synth: pres esan (diot, dio, diogu)
Verb Synth: past esan (nioen, zioen, genioen)
95 Suffix: Verb Object of nominalization etxea erostea/etxe
Verb Synth: pres irudi: dirudit, dirudi, dirudigu, etc
96 Word formation tasun, keria, keta, (tzai)le, zale, tar, tegi, dun, etc.
97 Particle ote
Particle omen
Particle ohi
98 Verb Participle: -a, -ta, -(r)ik irekia dago, irekita dago, irekirik dago
Suffix: Verb -tako, -(r)iko galdutako galtzeak
99 Ellipsis Leaving out the verb
100 Adjective/Adverb repetition polit-polita, bakar-bakarra
101 Phrase gero eta -ago, ahalik eta -en Gero eta handiagoa, ahalik eta handiena
Verb 2nd conditional (nor-nork) Full (nintzake, nintzakezu, gintzake, gintzakezu)
Verb 2nd conditional (nor-nori) Full (nenkizkizuke, lekidake, lekizkizuekete)
Verb 2nd conditional (nor-nori-nork) Full (niezaioke, niezazuke, niezazuekete)
Verb Potential: past same as 2nd conditional +(e)n?
Verb Conditional: Nor-Nork Full (ninduke, nindukezu, zintuzket, etc.)
Verb Conditional: Nor-Nori Full (nintzaioke, nintzaizkizuke, litzaioke, etc.)
Verb Conditional: Nor-Nori-Nork Full (nizuke, nizuekete, nioke, lidake, etc.)
Verb Subjunctive: Nor nadin/la, dadin/la, gaitezen/la
Verb Subjunctive: Nor-Nork dezadan/la, dezan/la
Phrase orain dela..., duela ... ... ago
Pronouns hi
Verb Imperative: Nor-Nork Full (including subjunctive forms)
Verb Imperative: Nor-Nori-Nork Full (iezadazu, iezaiozu, etc.)
Verb Imperative: Nor-Nori Used?? zakizkit, zakizkigute
Verb Synth: Imperative Eman: Nor-Nori-Nork
Verb Synth: Imperative Eman: Nor-Nori-Nork
Suffix: Verb bait- since, as: etorri baita
Conjunction Edo vs Ala
Adverb ere Baita ... ere, ... ere bai, etc.
Pronouns neu, zeu, etc
Verb participle + izan da, izan zen
Time Partial hours 1:30, 2:15, 20 to 4, etc.
Verb Subjunctive: Past All forms
Then i'd like to find/hire someone to record all the sentences. I'm not sure where exactly to go from there, i've been thinking of putting together an android app with detailed information on all the sentences and lets you group them together to make an audio Pimsleur sorta deal, but with the audio i've already got enough to use as a simple English->Basque audio course. There aren't really any audio courses for Basque that i know of, so i'm hoping i can help fill that gap and provide the course and tools i used to put it together for free.
1 x
-
- Black Belt - 3rd Dan
- Posts: 3533
- Joined: Thu Jul 30, 2015 11:04 am
- Location: Scotland
- Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc - x 8809
- Contact:
Re: [idea] Using large corpus to create language course
The big problem with a corpus-based programme is that you're working with someone else's data, and that's technically not allowed, so you could get shut down at any moment from any number of people involved in any of the DVDs you're working from. I'd be very wary about putting a lot of work into creating something, even something non-commercial, that breaches someone else's copyright.
One piece of advice I have is a tip I picked up from machine learning classes: using probabilities for scoring/rating difficulty.
Consider:
If 0 is "the learner has no chance of getting it right" and 1 is "the learner is certain to get it right", then it becomes easy to combine the ratings of components, by multiplying the probabilities -- multiplying any number of probabilities will never lead to a higher number than the lowest of them, so we are guaranteed to tag complex tasks as being more difficult than simpler ones.
The most naive algorithm would be to rate individual words, and have the final rating of the sentence as just the product of all the words.
e.g.
P(begi) = 0.75
P(politenak) = 0.5
P(dituzu) = 0.6
P(begi politenak) = P(begi) * P(politenak) = 0.375
P(begi politenak ditugu) = P(begi) * P(politenak) * P(dituzu) = 0.225
This is obviously just an approximation. First up, it doesn't need to be a true probability (it wouldn't be practical to count how many times they get it right/wrong, because we assume they learn it better each time they're exposed to the word), we just need the difference between hard and easy things to be relatively consistent.
Also, because I've used a naive algorithm, fragments aren't penalised compared to full sentences, even though full sentences are generally easier to process, but it's up to you to sort out your own scoring algorithm as best fits your data.
I would say, though, that it's a good idea to work modularly and start off with a fairly simple algorithm. A well engineered Python program will let you swap scoring algorithms later, so I recommend keeping it simple to start off with rather than getting bogged down in the best possible scoring -- once the rest of the software is established, you can start improving the algorithm if you need to (and if you design for modularity then you can even A-B test two or more different algorithms against each other).
One piece of advice I have is a tip I picked up from machine learning classes: using probabilities for scoring/rating difficulty.
Consider:
If 0 is "the learner has no chance of getting it right" and 1 is "the learner is certain to get it right", then it becomes easy to combine the ratings of components, by multiplying the probabilities -- multiplying any number of probabilities will never lead to a higher number than the lowest of them, so we are guaranteed to tag complex tasks as being more difficult than simpler ones.
The most naive algorithm would be to rate individual words, and have the final rating of the sentence as just the product of all the words.
e.g.
P(begi) = 0.75
P(politenak) = 0.5
P(dituzu) = 0.6
P(begi politenak) = P(begi) * P(politenak) = 0.375
P(begi politenak ditugu) = P(begi) * P(politenak) * P(dituzu) = 0.225
This is obviously just an approximation. First up, it doesn't need to be a true probability (it wouldn't be practical to count how many times they get it right/wrong, because we assume they learn it better each time they're exposed to the word), we just need the difference between hard and easy things to be relatively consistent.
Also, because I've used a naive algorithm, fragments aren't penalised compared to full sentences, even though full sentences are generally easier to process, but it's up to you to sort out your own scoring algorithm as best fits your data.
I would say, though, that it's a good idea to work modularly and start off with a fairly simple algorithm. A well engineered Python program will let you swap scoring algorithms later, so I recommend keeping it simple to start off with rather than getting bogged down in the best possible scoring -- once the rest of the software is established, you can start improving the algorithm if you need to (and if you design for modularity then you can even A-B test two or more different algorithms against each other).
3 x
-
- Blue Belt
- Posts: 514
- Joined: Mon Nov 30, 2015 3:35 pm
- Languages: EN (N), ES, ZH
Maintain: EUS, YUE, JP, HAW
Study: TGL, SV
On Hold: RU - x 953
Re: [idea] Using large corpus to create language course
I'm not too concerned about the copyright stuff, at most i'd be pulling maybe 10 sentences from an entire movie, changing/removing names, and sometimes even changing the sentence around. I don't think you'd even know which movie it was from. And courses like Duolingo commonly use famous phrases from movies as well. Not to mention that these subtitles are basically all fansubs (for better or for worse) of movies that have no Basque audio.
My idea is similar to yours, except i was planning on using word frequency (within the comparatively small corpus) and taking the average frequency of all words. Taking your example, "begi politenak" and "begi politenak dituzu" would be sorted according to the frequency of dituzu. If dituzu is more common than the average of "begi politenak" then it would have a lower (better) score, if it were less common, then it would have a higher (worse) score. I'm not sure if it's the best way or not, i like your difficulty rating idea but ideally that would also do some parsing of the sentence to check for advanced grammar (mostly verb usage) which is perhaps a bit beyond the scope of this project. But i plan on hand picking these sentences anyway so it's not too big of a deal. My end goal is to get the sentences professionally recorded and hopefully released under a free license so others can use them how they like. Personally i want to build them into an audio course of sorts. I've got an idea for an app, but before getting carried away i just want to get the base built.
Thanks a lot for your input!
My idea is similar to yours, except i was planning on using word frequency (within the comparatively small corpus) and taking the average frequency of all words. Taking your example, "begi politenak" and "begi politenak dituzu" would be sorted according to the frequency of dituzu. If dituzu is more common than the average of "begi politenak" then it would have a lower (better) score, if it were less common, then it would have a higher (worse) score. I'm not sure if it's the best way or not, i like your difficulty rating idea but ideally that would also do some parsing of the sentence to check for advanced grammar (mostly verb usage) which is perhaps a bit beyond the scope of this project. But i plan on hand picking these sentences anyway so it's not too big of a deal. My end goal is to get the sentences professionally recorded and hopefully released under a free license so others can use them how they like. Personally i want to build them into an audio course of sorts. I've got an idea for an app, but before getting carried away i just want to get the base built.
Thanks a lot for your input!
0 x
-
- Orange Belt
- Posts: 214
- Joined: Sun Feb 14, 2016 5:10 am
- Languages: gibberish (N)
- x 291
Re: [idea] Using large corpus to create language course
I keep my corpora on a big TXT file, that has been processed so that every sentence is separated by u'\n'. My index file now uses Python's sqlite3. Each unique wordform is mapped to a list of tuples indicating the offset of its sentence in the text file, and its sequence within that sentence.
(Your data is very small, you don't need to worry about databases and stuff, you may just load everything to memory)
This is my indexer: My operational version works with Arabic script, I have quickly wrote some modifications so that this may work with Latin script language, but I have not tested it at all.
Here's my fetcher:
Again, I have quickly edited it to remove some language specific code, and have not tested it at all.
I got my subs from http://opus.lingfil.uu.se/OpenSubtitles2016.php
On a similar project, I had a function that got a sentence, and returned the frequency of the rarest word in that sentence, then I sorted the corpus by this value. But I see I should have made arrangements to learn concrete words earlier. When I started picking sentences with words from the bottom of the frequency list, I ended up with a salad pronouns, prepositions, conjunctions, auxiliary verbs... that was really baffling.
(Your data is very small, you don't need to worry about databases and stuff, you may just load everything to memory)
This is my indexer:
Code: Select all
#encoding:utf8
import unicodedata as unicodedata
#from cPickle import *
import struct
import os
import sqlite3
#The file to be indexed needs to have its sentences separated by u'\n'.
tdict = {}#Character normalisation dictionary to be used with str.translate()
def indexSubs(filePath):
global wordIndex
print('making index...')
counter = 0
counter2 = 0
with open (filePath, 'rb') as f:
pos = 0
for line in f:
words = tokenise(line.decode('utf8', 'ignore'))
for n, w in enumerate(words):
if n > 255:
continue
if w not in wordIndex:
wordIndex[w] = []
wordIndex[w].append(pos)
wordIndex[w].append(n)
pos += len(line)
counter += 1
counter2 +=1
if counter >=50000:
print(counter2)
counter = 0
#with open (filePath + '.pkl', 'wb') as f:
# dump(wordIndex, f , HIGHEST_PROTOCOL)
#print('dumped index', filePath + '.pkl')
def deDiacritic(text):
return u''.join([a for a in text if unicodedata.category(a) not in ('Mn', 'Lm')])
neuterCats = set(('Mn', 'Lm'))
def tokenise(verse):
"returns a list of words from a sentence"
#if this is changed, indexing should be redone
verse = verse.translate(tdict)
curword = []
wordlist = []
for chr in verse:
if unicodedata.category(chr) in neuterCats:
continue
if unicodedata.category(chr).startswith('L'):
curword.append(chr)
else:
if len(curword) >= 1:
wordlist.append(u''.join(curword))
curword = []
if len(curword) > 0:
wordlist.append(u''.join(curword))
return wordlist
def packIdx(seq):
return struct.pack('>' + 'IB'*(len(seq)/2), *seq)
def unpackIdx(mydata):
return struct.unpack('>' + 'IB'*len(mydata)/5, mydata)
for learnSubsFile in os.listdir(u'.'):
#higFreq = set()#not used
wordIndex = {}
indexSubs(learnSubsFile)
print('finished compiling index, attempting to save to database')
db_file = learnSubsFile +'.sql'
with open( db_file, 'wb') as f:
pass
conn = sqlite3.connect(db_file)
c = conn.cursor()
sql = '''create table WORDINDEX(
WORD TEXT PRIMARY KEY,
LOCATIONS BLOB);'''
c.execute(sql)
sql = '''INSERT INTO WORDINDEX(WORD, LOCATIONS)
VALUES(?, ?);'''
for n, word in enumerate(wordIndex):
if n % 50000 == 0:
print(learnSubsFile, wordIndex[word])
#conn.commit()
conn.execute(sql,[word, sqlite3.Binary(packIdx(wordIndex[word]))])
conn.commit()
conn.close()
Here's my fetcher:
Code: Select all
#encoding:utf8
import unicodedata as unicodedata
import struct
import sqlite3
from random import random
tdict = {}#language specific spelling normaliser
neuterCats = set(('Mn', 'Lm'))
def tokenise(verse):
"returns a list of words from a sentence"
#if this is changed, indexing should be redone
verse = verse.translate(tdict)
curword = []
wordlist = []
for chr in verse:
if unicodedata.category(chr) in neuterCats:
continue
if unicodedata.category(chr).startswith('L'):
curword.append(chr)
else:
if len(curword) >= 1:
wordlist.append(u''.join(curword))
curword = []
if len(curword) > 0:
wordlist.append(u''.join(curword))
return wordlist
def mixLists(listOfLists):
l = []
while sum([len(a) for a in listOfLists]) > 0:
for a in listOfLists:
if len(a) > 0:
l.append(a.pop(0))
return l
def packIdx(seq):
return struct.pack('>' + 'IB'*(len(seq)/2), *seq)
def unpackIdx(data):
return struct.unpack('>' + 'IB'*(len(data)/5), data)
def unpackBuffer(data):
return struct.unpack_from('>' + 'IB'*(len(data)/5), data)
class corpusSearch():
def __init__(self, filename):
self.learnSubsFile = homeDir + filename
self.indexFile = self.learnSubsFile +'.sql'
self.filename = filename
self.conn = sqlite3.connect(self.indexFile)
self.c = self.conn.cursor()
self.corpusText = open(self.learnSubsFile, 'rb')
def getOneWord(self, word):
word = word.translate(tdict).strip()
entry = list(self.c.execute(u"SELECT * FROM WORDINDEX WHERE WORD = ?" ,[word]))
if len(entry)> 0:
locations = unpackBuffer(entry[0][1])
sent = locations[0:500:2]
return sent
else:
return []
def getWordLoc(self, word):
word = word.translate(tdict).strip()
entry = list(self.c.execute(u"SELECT * FROM WORDINDEX WHERE WORD = ?" ,[word]))
if len(entry)> 0:
locations = unpackBuffer(entry[0][1])
sent = locations[0:-1:2]
seq = locations[1::2]
loc = set(zip(sent,seq))
return loc
else:
return set()#[set(), set()]
def getExprLoc(self, expr):
wordlist = tokenise(expr)
if len(wordlist) == 1:
return self.getOneWord(wordlist[0])
wordLocs = [ self.getWordLoc(a) for a in wordlist]
wordLocs = sorted(enumerate(wordLocs), key = lambda a: len(a[1]))
firstLoc = wordLocs[0][1]
firstLoc = {(a[0], a[1] - wordLocs[0][0]) for a in wordLocs[0][1]}
for nextLoc in wordLocs[1:]:
firstLoc = { (a[0], a[1] + nextLoc[0]) for a in firstLoc }
firstLoc = firstLoc.intersection(nextLoc[1])
if nextLoc == wordLocs[-1]:
break
firstLoc = {(a[0], a[1] - nextLoc[0]) for a in firstLoc}
return [a[0] for a in firstLoc]
def getSentences(self, expr):
result = []
locations = sorted(self.getExprLoc(expr), key = lambda a: random())[:40]
for a in sorted(locations):
self.corpusText.seek(a)
result.append(self.corpusText.readline().decode('utf8').strip() + u'\n' + self.filename + u' ' +str(len(locations)))
return sorted(result, key = lambda a: random())
coll = corpusSearch('colloquialNovels.u8')
form = corpusSearch('formalNovels.u8')
subs = corpusSearch('OpenSubtitles2016.en-fa.fa')
def getExpr(word):
sents = []
for corpus in [ coll, form, subs]:
sents += corpus.getSentences(word)
return sorted([[a] for a in sents], key = lambda a: random())
I got my subs from http://opus.lingfil.uu.se/OpenSubtitles2016.php
On a similar project, I had a function that got a sentence, and returned the frequency of the rarest word in that sentence, then I sorted the corpus by this value. But I see I should have made arrangements to learn concrete words earlier. When I started picking sentences with words from the bottom of the frequency list, I ended up with a salad pronouns, prepositions, conjunctions, auxiliary verbs... that was really baffling.
3 x
-
- Black Belt - 3rd Dan
- Posts: 3533
- Joined: Thu Jul 30, 2015 11:04 am
- Location: Scotland
- Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc - x 8809
- Contact:
Re: [idea] Using large corpus to create language course
crush wrote:My idea is similar to yours, except i was planning on using word frequency (within the comparatively small corpus) and taking the average frequency of all words. Taking your example, "begi politenak" and "begi politenak dituzu" would be sorted according to the frequency of dituzu. If dituzu is more common than the average of "begi politenak" then it would have a lower (better) score, if it were less common, then it would have a higher (worse) score. I'm not sure if it's the best way or not, i like your difficulty rating idea but ideally that would also do some parsing of the sentence to check for advanced grammar (mostly verb usage) which is perhaps a bit beyond the scope of this project.
Ah, but if you use a generic scoring, you can substitute for a different algorithm later -- start with a simple one, then improve.
Can I suggest a slight variation...? I was talking about ptobabilities, but if you're looking at frequencies, you could do fractions.
The score of a word/token will then be the number of times it occurs divided by the total number of words in the corpus, giving a number between 0 and 1, with higher numbers being more frequent. This means all your numbers are going to be in the lower ranges, and it might seem a bit silly, but there's a reason the experts do it this way.
And actually, frequency is probability anyway - if "eta" is 1/10 of all words in your corpus, you've got a 1:10 chance of getting "eta" if you pick a word at random.)
I'm not convinced that averaging your scores is a good idea. Imagine you have a sentence like the following (in English, for clarity):
In my cupboard there are three big green postillions
Because there are so many common words in the sentence, the fact that there is one extremely rare word in there is completely obscured from the scoring.
Now I know that's an extreme example, and you're not going to include examples with such rare vocab, but I'm just trying to demonstrate the concept.
Also, if you're using averages of frequency, then the following sentence is actually ranked as easier/better than the shorter, simpler one above:
I think he said that there were three big green postillions in his cupboard.
...and that problem is a particular issue when going by frequency, because conjunctions are very common words, but only occur in complex sentences, and so you're going to inadvertently favour those complex sentences.
I suppose I should admit that I actually made the same mistake myself, and it was immediately obvious that I was just getting the longest sentences in my test set every time... and most often complex sentences with "chi" (that).
Anyhow, as a general rule, a task should be considered at least as difficult as the most difficult element in it.
2 x
-
- Blue Belt
- Posts: 514
- Joined: Mon Nov 30, 2015 3:35 pm
- Languages: EN (N), ES, ZH
Maintain: EUS, YUE, JP, HAW
Study: TGL, SV
On Hold: RU - x 953
Re: [idea] Using large corpus to create language course
DangerDave2010 wrote:I keep my corpora on a big TXT file, that has been processed so that every sentence is separated by u'\n'. My index file now uses Python's sqlite3. Each unique wordform is mapped to a list of tuples indicating the offset of its sentence in the text file, and its sequence within that sentence.
(Your data is very small, you don't need to worry about databases and stuff, you may just load everything to memory)
Thanks for your input. I just got things mostly cleaned up (removed those "translated by" lines and other similar things) and it turns out that my original estimate wasn't too far off, i've got 641,345 lines of dialog and just under 3.2 million words.
Cainntear wrote:I'm not convinced that averaging your scores is a good idea. Imagine you have a sentence like the following (in English, for clarity):
In my cupboard there are three big green postillions
Because there are so many common words in the sentence, the fact that there is one extremely rare word in there is completely obscured from the scoring.
...
Anyhow, as a general rule, a task should be considered at least as difficult as the most difficult element in it.
That's true and definitely something i'll have to put some thought into, for now though i think i'll be hand-picking sentences anyway, so if there are words that are too difficult/rare, i can replace them with more common words myself. I'm mostly looking to build a corpus of grammatical sentences (though there are always typos and so on) that i can use to search for sentences with the grammar point i'm looking to cover. It can all be fine-tuned as i get further along, i suppose.
EDIT: My friend just showed me the Goenkale (a Basque TV show) Corpus, with 20 years' worth of episodes (2,994 episodes altogether). You can search it by morphology (case, tense, verb type, mood, even whether the hika form is male or female). The only thing is that it doesn't give frequency information, but this is the sort of tool i had in mind for this project. The Egungo Testuen Corpusa does have frequency data, though. It's from post-2000 literature.
EDIT2: Just noticed the "maiztasunak" tab, so there is frequency data as well! Also, words are organized by lemma.
1 x
Return to “Language Programs and Resources”
Who is online
Users browsing this forum: No registered users and 2 guests