Here are some examples for Catalan (which has relatively few subtitles, so it doesn't take as long to build as English or French). The models include letter frequencies:
Code: Select all
a,0.1086562448
e,0.1085309941
s,0.0702849359
r,0.0634981261
n,0.0555441503
i,0.0549148385
t,0.054184073
o,0.0510150599
l,0.0489069848
u,0.0417314429
.,0.0356342204
m,0.0303537007
d,0.0266670496
c,0.0262300524
p,0.022159369
v,0.0145261661
q,0.0129763105
",",0.0124501683
g,0.0121331999
b,0.0107803585
...
And pairs of letters:
Code: Select all
a ,0.0260983659
".
",0.0226165649
s ,0.0208378551
es,0.0154932561
e ,0.0153976514
e,0.0138557902
a,0.013622293
d,0.0133115931
en,0.0124231217
er,0.0123080793
ar,0.0112933537
p,0.0111326721
qu,0.01055094
re,0.0101962461
", ",0.0099413003
n ,0.0093074701
r ,0.009261709
de,0.0092203352
t ,0.0090235809
s,0.0087969083
ue,0.0084943125
l,0.0081293209
m,0.0081124423
la,0.0081000119
el,0.0080448061
l ,0.0079536496
o ,0.0077572609
c,0.0076975461
an,0.007696693
ta,0.0076229635
And word frequencies:
Code: Select all
que,0.0310486004
no,0.0289909297
de,0.0276044436
la,0.0270275549
a,0.0221585201
el,0.0221174341
i,0.017653198
és,0.0154119871
per,0.013375533
un,0.0129613047
en,0.0121513705
una,0.0100010878
ho,0.0088971525
què,0.0088132965
els,0.0082118236
amb,0.0077363029
va,0.0071600878
ha,0.0065266216
si,0.0061871564
com,0.0061635825
les,0.0060500906
això,0.0058507222
em,0.0055850099
però,0.0053802532
bé,0.0053068371
sí,0.0052960604
hi,0.0051206027
al,0.0049094473
del,0.0046039961
et,0.0041975135
es,0.0041352109
tu,0.0041042279
més,0.0040722347
jo,0.0039210245
fer,0.0038199932
aquí,0.003624666
These were built from 519,120 lines of Catalan subtitles provided by OpenSubtitles and cleaned up by OPUS. I'm going to make these models publicly available as CSV files for at least 50 or 60 languages (with proper citations, etc.). But I'll need to rent a big, beefy server in the cloud to parse all the text, which may need to wait until next week.
You may find the vocabulary frequencies useful if you prepare Anki decks using tools that try to show you the most popular words first.