Rust subtitle utilities project

Postby **emk** » Fri Mar 24, 2017 10:24 pm

I can now build simple language "models"! These should help me "decrypt" the image-based subtitles into text.

Here are some examples for Catalan (which has relatively few subtitles, so it doesn't take as long to build as English or French). The models include letter frequencies:

Code: Select all

a,0.1086562448
e,0.1085309941
s,0.0702849359
r,0.0634981261
n,0.0555441503
i,0.0549148385
t,0.054184073
o,0.0510150599
l,0.0489069848
u,0.0417314429
.,0.0356342204
m,0.0303537007
d,0.0266670496
c,0.0262300524
p,0.022159369
v,0.0145261661
q,0.0129763105
",",0.0124501683
g,0.0121331999
b,0.0107803585
...

And pairs of letters:

Code: Select all

a ,0.0260983659
".
",0.0226165649
s ,0.0208378551
es,0.0154932561
e ,0.0153976514
 e,0.0138557902
 a,0.013622293
 d,0.0133115931
en,0.0124231217
er,0.0123080793
ar,0.0112933537
 p,0.0111326721
qu,0.01055094
re,0.0101962461
", ",0.0099413003
n ,0.0093074701
r ,0.009261709
de,0.0092203352
t ,0.0090235809
 s,0.0087969083
ue,0.0084943125
 l,0.0081293209
 m,0.0081124423
la,0.0081000119
el,0.0080448061
l ,0.0079536496
o ,0.0077572609
 c,0.0076975461
an,0.007696693
ta,0.0076229635

And word frequencies:

Code: Select all

que,0.0310486004
no,0.0289909297
de,0.0276044436
la,0.0270275549
a,0.0221585201
el,0.0221174341
i,0.017653198
és,0.0154119871
per,0.013375533
un,0.0129613047
en,0.0121513705
una,0.0100010878
ho,0.0088971525
què,0.0088132965
els,0.0082118236
amb,0.0077363029
va,0.0071600878
ha,0.0065266216
si,0.0061871564
com,0.0061635825
les,0.0060500906
això,0.0058507222
em,0.0055850099
però,0.0053802532
bé,0.0053068371
sí,0.0052960604
hi,0.0051206027
al,0.0049094473
del,0.0046039961
et,0.0041975135
es,0.0041352109
tu,0.0041042279
més,0.0040722347
jo,0.0039210245
fer,0.0038199932
aquí,0.003624666

These were built from 519,120 lines of Catalan subtitles provided by OpenSubtitles and cleaned up by OPUS. I'm going to make these models publicly available as CSV files for at least 50 or 60 languages (with proper citations, etc.). But I'll need to rent a big, beefy server in the cloud to parse all the text, which may need to wait until next week.

You may find the vocabulary frequencies useful if you prepare Anki decks using tools that try to show you the most popular words first.

Postby **emk** » Sat Mar 25, 2017 11:54 am

emk wrote:I'm going to make these models publicly available as CSV files for at least 50 or 60 languages (with proper citations, etc.). But I'll need to rent a big, beefy server in the cloud to parse all the text, which may need to wait until next week.

Well, I went ahead a rented a 36-core server for two hours, and I analyzed (I think) almost every single subtitle file in every language supported by the OpenSubtitles community and the OPUS research project. Here's a picture of me maxing out all those processors (my apologies for the huge size; I tried shrinking it but it came out blurrier and with a larger file size!):

You'll notice that the CPU column sums to considerably more than 100%! This ended up costing about $3 because I shut the server down very quickly once I was done. :-)

You can download the individual models here. These contain letter frequency, letter pair frequency and word frequency data, though I suspect the data is pretty much garbage for Chinese and other languages that don't use spaces much. Oh, and I'm pretty sure I've messed up the dotless i in Turkish. I'll probably build another version later which fixes these issues and reduces file size a bit. (I think I may want to use the negative base-2 log probability (-log2(prob)) instead of raw probability—that will lose some accuracy, but the files will be smaller.) But the word frequency data may be interesting to intermediate language learners who are interested in spoken languages.

To decompress the files, rename them from "en.submodel" to "en.tar.gz" and open them with your favorite archiving tool. Each file contains CSV data which you should be able to load into a spreadsheet of your choice.

Anyway, this is a key step in subtitle OCR—I need to know what letters, letter pairs and words are most common in each language, so that I can test my OCR output to see if I've matched up all the letters correctly.

Cainntear · Postby **Cainntear** » Sat Mar 25, 2017 1:49 pm

Stop tempting me with your lovely, lovely data! I have assignments to work on!

tommus · Postby **tommus** » Sat Mar 25, 2017 2:25 pm

emk wrote:To decompress the files, rename them from "en.submodel" to "en.tar.gz" and open them with your favorite archiving tool.

Shouldn't that be "en.csv.gz"?

My 7-Zip extracts the .gz to a .csv file, not a .tar? Maybe it is doing both in the background. But when I use "en.tar.gz", 7-Zip extracts that to a file with a .tar extension but the file is actually a .csv file.

Cainntear · Postby **Cainntear** » Sat Mar 25, 2017 5:11 pm

tommus wrote:
emk wrote:To decompress the files, rename them from "en.submodel" to "en.tar.gz" and open them with your favorite archiving tool.

Shouldn't that be "en.csv.gz"?

My 7-Zip extracts the .gz to a .csv file, not a .tar? Maybe it is doing both in the background. But when I use "en.tar.gz", 7-Zip extracts that to a file with a .tar extension but the file is actually a .csv file.

I suspect that his compression just skipped the .tar step as tar is pointless for a single file.

Postby **emk** » Sat Mar 25, 2017 7:16 pm

tommus wrote:Shouldn't that be "en.csv.gz"?

My 7-Zip extracts the .gz to a .csv file, not a .tar? Maybe it is doing both in the background. But when I use "en.tar.gz", 7-Zip extracts that to a file with a .tar extension but the file is actually a .csv file.

The models are a gzip file containing a tar archive containing three CSV files (with messed up permissions--I'll fix them in the next version). I'm not sure what your 7-Zip program is up to. :-/

I'm going to write another program which takes these submodel files as input, so most users will never need to decompress them. But I wanted to make sure it was possible to open them up and look if you really wanted.

The three CSV files for each language are:

graphemes.csv: Frequencies of Unicode "grapheme clusters". (If you don't know what this means, just pretend that "grapheme cluster" is a fancy way to say "letter". Except some writing systems are weird and it's more complicated.)
pairs.csv: Frequencies of pairs of grapheme clusters. The OCR engine will need this to evaluate the quality of its guesses.
words.csv: Frequencies of words (all in lowercase).

tommus · Postby **tommus** » Sat Mar 25, 2017 11:18 pm

I was using 7-Zip version 15.14 for Windows 7/64. So I updated it to the current version 16.04. But the same problem.

I start with nl.tar.gz.

7-Zip extracts it to nl.tar.

Then when I use 7-Zip on nl.tar, it says "Can not open the file as [tar] archive. Is not archive.

So I tried Linux (Ubuntu 16.04). It works perfectly in one step and produces the three files.

Strange. I don't understand how 7-Zip would not be working correctly, and two versions the same. Anyone else able to extract these three files using 7-Zip on Windows?

Cainntear · Postby **Cainntear** » Sun Mar 26, 2017 10:49 am

Tommus,
What are the contents of the CSV file you're getting? Is it one of the three files? If so, which one? Or does it run all three together.

tommus · Postby **tommus** » Sun Mar 26, 2017 2:36 pm

When I right-click en.tar.gz and extract with 7-Zip, I get a single file en.tar, which will not open in 7-Zip.

If I change the name to en.csv, it will open in Excel with two columns. The second column contains the occurrence stats.

The first item in the first column is: graphemes.csv0000500000000155710004656 e
Then the letters, etc.

Then pairs.csv0000500000005022200004011 en
and then pairs, or single letters with a single space.

Then words.csv0000500000006244370004050 ik
and the words.

These three are the three file names, some numbers and the first item in that file (e, en, ik)

In other words, the three files essentially merged in the two columns.

In a plain text editor, the en.tar file contains these entries (plus the data):

graphemes.csv 0000500 00000015571 0004656 e,0.1702378572

pairs.csv 0000500 00000502220 0004011 en,0.0303551223

words.csv 0000500 00000624437 0004050 ik,0.0380278386

Probably this is the info that .tar needs. But somehow, 7-Zip in Windows doesn't see it that way. But Linux is OK with it.

Postby **emk** » Sun Mar 26, 2017 3:09 pm

I generated these files with a tar library, not the command line tool, and so it's definitely possible that some fields are zero when 7zip in expecting that. I thank you for finding this and I'll try to take a look later.

A language learners’ forum

Rust subtitle utilities project

Re: Rust subtitle utilities project

Re: Rust subtitle utilities project

Re: Rust subtitle utilities project

Re: Rust subtitle utilities project

Re: Rust subtitle utilities project

Re: Rust subtitle utilities project

Re: Rust subtitle utilities project

Re: Rust subtitle utilities project

Re: Rust subtitle utilities project

Re: Rust subtitle utilities project

Who is online