Rust subtitle utilities project

All about language programs, courses, websites and other learning resources
Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6330
Contact:

Rust subtitle utilities project

Postby emk » Mon Feb 13, 2017 1:48 pm

(This thread is split out of the substudy thread.)

A while back, I wrote a utility called substudy, which is a portable command-line version of the excellent Subs2SRS utility. It provides several tools, including support for converting subtitles to Anki audio cards, and support for generating MP3s playlists containing only the dialog from a video.

But as part of a future version of substudy, I've been working on lower-level subtitle libraries and tools. These are all in the public domain and they're written in Rust. So far, I have:

  • vobsub: A Rust library for parsing sub/idx files.
  • vobsub2png: A command-line utility for converting sub/idx files to PNG images with JSON metadata.
For example, if you have two files "movie.idx" and "movie.sub", then running:

Code: Select all

vobsub2png movie.idx

…will create a directory "movie_subtitles" containing "index.json", "0001.png", "0002.png", etc. If you know how to parse JSON, you could turn this into Anki cards or a web page displaying all the subtitles in a film. These subtitles look like:

Image

These tools—and hopefully others soon—are available in my Rust Subtitle Utilities project on GitHub. I expect work on this project will proceed slowly, but the next item on my agenda is working on improved subtitle OCR.
2 x

User avatar
kunsttyv
Orange Belt
Posts: 103
Joined: Mon Aug 03, 2015 11:24 am
Location: Trondheim
Languages: Norwegian (native)
Spanish (learning)
x 212

Re: Rust subtitle utilities project

Postby kunsttyv » Mon Feb 13, 2017 2:20 pm

Hi emk, thanks for taking the initiative creating such useful tools for language learners. I have some limited experience with subs2srs, and while I could see its potential and usefulness, I didn't really use it that much since I usually don't have access to a Windows installation, and also I didn't like the clunky uncustomizable nature of the program.

How much effort would you think it would take someone with a decent amount of experience in imperative languages like Python, C etc. to be functional in Rust? Is it just a matter of adjusting oneself to the syntax, or would it be necessary to learn totally new concepts or even coding paradigms? I've heard that Rust comes with a pretty comprehensive type system for handling type safety issues.

And one more thing, this utility project, would it only be for "subtitle study" or do you plan to incorporate other srs functionality as well? I have written some code to generate anki cards from kindle dictionary look-ups. It's working nicely for my specific setup, but I would like to rewrite it and make it more robust.
0 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6330
Contact:

Re: Rust subtitle utilities project

Postby emk » Mon Feb 13, 2017 4:09 pm

kunsttyv wrote:Hi emk, thanks for taking the initiative creating such useful tools for language learners. I have some limited experience with subs2srs, and while I could see its potential and usefulness, I didn't really use it that much since I usually don't have access to a Windows installation, and also I didn't like the clunky uncustomizable nature of the program.

You might want to try out my substudy command-line tool. It's trivial to install on Mac or Linux systems, and despite not having a GUI, it actually has a much simpler "UI" than Subs2SRS because it doesn't insist on asking you dozens of questions when it can just figure out things on its own.

kunsttyv wrote:How much effort would you think it would take someone with a decent amount of experience in imperative languages like Python, C etc. to be functional in Rust? Is it just a matter of adjusting oneself to the syntax, or would it be necessary to learn totally new concepts or even coding paradigms? I've heard that Rust comes with a pretty comprehensive type system for handling type safety issues.

Rust combines several old and new ideas. If you already know most of these, the learning curve shouldn't be too bad (maybe a week or two to be truly comfortable). But it none of this is familiar, it might take a while:

  • Memory. Like C and C++, Rust exposes the difference between the stack and the heap, and between passing things as values and passing them as references. People who've only worked in garbage-collected languages might find this confusing.
  • Anonymous functions. Like JavaScript, Ruby, or a functional language, Rust tends to rely more heavily on functions which take anonymous functions as arguments. Python or C programmers might find this surprising, but users of any other reasonably modern programming language are totally used to this.
  • Generic types. Like C++, C#, Java and TypeScript, Rust uses generic types. So instead of using a type like "Vec" to represent a simple array, you would write "Vec<u8>" or "Vec<String>", where "u8" and "String" are compile-time type parameters. This may be challenging for people who've only ever used C and scripting languages. Generic types do add some complexity to the language, but they make it easier to write "zero-overhead" abstractions: code which is both fast and high-level. So it's a tradeoff.
  • The "borrow checker". This is the only genuinely new thing in Rust. Basically, all values must be "owned" by some piece of code, and the owner can choose to hand out either single mutable reference at a time, or multiple read-only references. This allows Rust to have automatic, correct memory management without needing a garbage collector. For some people, they already code in a style where "who owns what" is mostly obvious, and they'll adjust relatively rapidly. Other people will take longer and may need to change how they think about programs. (This also means that implementing doubly-linked lists in Rust is an advanced topic.)
Good Rust introductions include:

For the kind of work I'm doing in this project, I would say the advantages of Rust are:

  • If a Rust program compiles at all, it typically works correctly. In most languages, it's relatively easy to get your code to compile, but you often need to spend a fair bit of time running it and debugging it. Rust has a different workflow: You may spend more time trying to figure out compiler error messages, but once your program actually compiles, it will very often work correctly on the first try. Some people love this; other people get depressed "fighting" with the compiler and just want to see their program run, even if it's not correct yet. :-) (If you get stuck, go to the Mozilla IRC server and ask for help on the #rust channel.)
  • Rust code tends to be fast. This is especially useful for things like parsing MPEG-2 format subtitles, or for doing OCR. Rust gives you more-or-less the same speed as C, but with many convenient things you'd expect from higher-level languages.
  • Rust's "cargo" is an excellent library manager and build tool. Cargo provides support for downloading third-party libraries and compiling your code. It's very advanced and well-thought-out, and you can easily use any library on crates.io.
  • It's easy to make Rust programs work on Mac, Windows and Linux. This is partly thanks to Cargo, and partly thanks to the fact that common Rust libraries handle cross-platform issues for you. My vobsub2png tool worked on MacOS and Windows on the very first try!
  • It's easy to make standalone statically-linked binaries using Rust. This makes it trivial to just download and unzip a binary without a lot of fuss, even on Linux.
The disadvantages are that Rust may be overkill, especially for throwaway scripts, and especially if you haven't learned Rust yet!

kunsttyv wrote:And one more thing, this utility project, would it only be for "subtitle study" or do you plan to incorporate other srs functionality as well? I have written some code to generate anki cards from kindle dictionary look-ups. It's working nicely for my specific setup, but I would like to rewrite it and make it more robust.

Nice! I'm always happy to see people working on language-learning tools!

Personally, I actually have two projects here: substudy, which is meant for learning languages using subtitles, and the "Rust subtitle utilities" that I'm discussing in this thread, which are a bit lower-level but which may be useful to language learners building their own tools, and which may someday be part of a newer version of substudy.

For both projects, I'm very likely to focus strictly on learning using subtitles and video. It's already a huge project as is! For working with text, there are already tools like readlang.com, Learning with Texts, and my own (permanently beta, and not really maintained) SRS Collector, which can import Kindle highlights and turn them into Anki cards. (There's an associated Anki plugin and Chrome plugin that I used to provide to beta testers.)

But like leosmith, I increasingly feel like "Listening is Everything", especially in the beginning. I still feel like tools like readlang, LWT and SRS Collector are useful for rapidly improving vocabulary while doing extensive reading. But I feel like video is much more useful early on, and that's where I want to focus my efforts as both a student and a programmer. :-)
4 x

User avatar
kunsttyv
Orange Belt
Posts: 103
Joined: Mon Aug 03, 2015 11:24 am
Location: Trondheim
Languages: Norwegian (native)
Spanish (learning)
x 212

Re: Rust subtitle utilities project

Postby kunsttyv » Wed Feb 15, 2017 12:52 pm

Thank you so much for that very detailed answer! It's actually kind of cool that you're developing a util library solely devoted to the quite narrow activity of studying subtitled media with the use of spaced repetition software. It has to be one of the perks of living in our time, that people like you make odd creations to improve aspects of your lives, and are readily sharing it with us people across the globe. I will set aside some time this weekend getting your substudy app up and running on my linux os. Maybe I won't have time to delve into details of the Rust code, but it would be nice to have some functioning flashcards up and running. Cheers!
0 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6330
Contact:

Re: Rust subtitle utilities project

Postby emk » Sat Feb 18, 2017 6:35 am

The first three steps are working!

Step 1: Extract the subtitle images from the MPEG-2 Program Stream and decompress

This gives us the typical sub/idx image-based subtitles that we're all used to:

Image
These tend to have a transparent background, plus a shadow/outline color that connects multiple letters together.

Step 2: Remove the shadows and convert to black and white

The first order of business is to remove those annoying shadows, which we do by measuring some properties of adjacent pixels:

binarize.png

Step 3: Breaking the image up into contiguous segments

At this point, we can treat this as a standard "flood fill", and find out which dots are connected to which other dots:

segment.png

Next steps

There are still a lot of steps to go! Here are the next few:

  • Reattach the dots to the "i"s, and reassemble other "multi-part" letters.
  • Break the subtitle into lines.
  • Generate a statistical language model of all the languages we want to support.
  • Then the fun part! I need to map images to letters with as little human intervention as possible. I have a couple of tricks up my sleeve. :-)
You do not have the required permissions to view the files attached to this post.
0 x

Arnaud
Blue Belt
Posts: 984
Joined: Sat Jul 18, 2015 11:57 am
Location: Paris, France
Languages: Native: French
Intermediate: English, Russian, Italian
Tourist : Breton, Greek, Chinese, Japanese, German, Spanish, Latin
Language Log: viewtopic.php?t=1524
x 2172

Re: Rust subtitle utilities project

Postby Arnaud » Sat Feb 18, 2017 7:37 am

Hi emk.
I'm following that thread out of curiosity and because I myself extracted hard-coded subtitles from several russian series. I've done all "by hand" using several programs and cleaning the result by reading. The most problematic thing was with the letter ы, that was often OCRed like ь or ь|, so during the OCR, I replaced ы by ьi and then made a research in the OCRed text to replace ьi by ы. I had all kind of problems with that letter, often the OCR program didn't recognize it when attached to other letters, so I had to create a little "library" with all the combination of 2 letters: ры,вы, ый, etc. I also had my share of problem with й, the kind of "tilde" on и was often misinterpreted.
Edit: another problem was the double consonants, for exemple тт (tt in latin alphabet). As the OCR program didn't recognise тт, I had to add it to the "dictionary" as one letter, the letter тт. Problems appeared when I had 2 words like "вот так" very close, the program OCRed them like воттак. So I had to search for тт or other problematic combinaision in the OCRed texts.

The second big problem was how to have a sentence not cutted in the middle: when a sentence is very long, it's often cutted in several subtitles and you have to glue again the parts to have a whole sentence in your transcript. Also the OCR was giving big chunk of text and the only way to have a "one dialog/one line" in the transcript was to watch the series and return to the following line by hand.
Another problem specific to russian is that the sign - is used at the beginning of the sentences and inside sentences to replace copula verbs (no "to be" verb at the present tense in russian), so impossible "a priori" to recognise if - is the beginning of a sentence or is replacing a copula verb: you have to listen to the audio to decide what is what (but often, the capatalized letter after - indicates a new sentence).

I still have complete videos of russian subtitles (1 Mb per minute, about 45 min per episode) if you want to test your program on the russian cyrillic. The sound is not synchonised with the subtitles due to the slowness of my computer (another problem when you're correcting mistakes of OCR)
3 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6330
Contact:

Re: Rust subtitle utilities project

Postby emk » Sat Feb 18, 2017 1:22 pm

Arnaud wrote:The most problematic thing was with the letter ы, that was often OCRed like ь or ь|, so during the OCR, I replaced ы by ьi and then made a research in the OCRed text to replace ьi by ы. I had all kind of problems with that letter, often the OCR program didn't recognize it when attached to other letters, so I had to create a little "library" with all the combination of 2 letters: ры,вы, ый, etc. I also had my share of problem with й, the kind of "tilde" on и was often misinterpreted.
Edit: another problem was the double consonants, for exemple тт (tt in latin alphabet). As the OCR program didn't recognise тт, I had to add it to the "dictionary" as one letter, the letter тт. Problems appeared when I had 2 words like "вот так" very close, the program OCRed them like воттак. So I had to search for тт or other problematic combinaision in the OCRed texts.

This is super-useful information. Thank you very much!

Arnaud wrote:The second big problem was how to have a sentence not cutted in the middle: when a sentence is very long, it's often cutted in several subtitles and you have to glue again the parts to have a whole sentence in your transcript.

Fortunately, I don't have to deal with this problem quite yet, because I'm planning on converting the subtitles to SRT text format, which preserves the timing information and subtitle line breaks. I agree that it would be very handy to have a tool which converted SRT files into transcripts, or which extracted the sentences. Maybe someday!

Arnaud wrote:I still have complete videos of russian subtitles (1 Mb per minute, about 45 min per episode) if you want to test your program on the russian cyrillic. The sound is not synchonised with the subtitles due to the slowness of my computer (another problem when you're correcting mistakes of OCR)

If you have any good example subtitles (and it sounds like you do), please feel free to upload them to my dropbox here. Here's what I'm looking for, in order of preference:

  1. Subtitles in *.sub/*.idx format extracted from real-world videos. I'm ultimately trying to collect thousands of these. For languages like Cyrillic with tricky characters, I'd love to also have "clean" text versions, either in SRT format or as transcripts. Basically anything you have. :-)
  2. Videos with toggleable subtitles that haven't been OCRed yet. These are videos where you can turn the subtitles on and off. I can extract the vob/sub files from these myself.
  3. Not useful: Videos with "burned-in" subtitles that can't be turned on or off because the subtitle images have been merged into the raw video data.
Seriously, I'm looking for as much raw sub/idx data as I can get. This diagram by Banko & Brill shows the effects of adding more training data to a statistical learning problem:

banko-brill-more-data-better-algorithms.png

Here we see that with ~200,000 words of training data, several different algorithms perform with 75% to 83% accuracy. But with a billion words of training data, all the algorithms are much better, showing 93% to 97% accuracy.

Actually, there's a lesson for language learners here! If you want to really internalize a language, there's a dramatic benefit to getting a ridiculously large amount of comprehensible input. This is why I think the Super Challenge worked so well for me: By getting 2 to 3 million words of comprehensible French input, my brain was able to vastly improve my understanding of French. Humans still beat the machines here: we need millions of words, and the machines need billions, and our resulting comprehension is better than the machines'.
You do not have the required permissions to view the files attached to this post.
5 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6330
Contact:

Re: Rust subtitle utilities project

Postby emk » Sat Feb 18, 2017 1:57 pm

(I'm moving this portion of my reply from the substudy thread to this one because it's more of a low-level subtitle utility issue than a substudy issue.)

kaegi wrote:This sounds very interesting (both the .sub and .mkv part). I will see when I get to implement that. I also had the idea to use ffmpeg to automatically extract the subtitles but decided against such a heavyweight dependency (the binary is supposed to be the smallest usable wrapper around the library). But if there ever is a small standalone Rust library that does exactly that, it would be very helpful!

For more details of the various MPEG2-related formats, see my vobsub documentation, which links to several helpful background pages. A '*.sub' file is essentially an MPEG2 Program Stream file containing a single media channel: the subtitle image data. As I understand it, in a real MPEG2-format VOB file, that would be intermixed with various other audio, video and subtitle channels. So all that we'd need to change would be to ignore the packets for the channels we don't want (and deal with the fact that there are multiple subtitle channels).

For Matroska, the situation is a bit more complicated. It looks like they store the *.idx file directly in the main video file, and the images are rewrapped into Matroska packets. But I'd had to actually try dumping a Matroska stream to be sure. I'll probably get around to this at some point, because Matroska is the container format with the most flexible subtitle support.

Also, I appreciate your decision to relicense aligner under the GPL! I've added a link from my Rust subtitles project to your repository. Once I get ready to overhaul substudy, I'll definitely take a good, detailed look at how to best integrate it. Thank you once again for putting so much work into building subtitle tools and for sharing your code with the world!
1 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6330
Contact:

Fun with Latin characters

Postby emk » Sun Feb 19, 2017 12:56 pm

This post is for typography and script geeks! The Latin alphabet is more complicated than it looks.

Here's a fun subtitle with a lot going on:

two_line_subtitle_italic_colon.png

The first thing to observe here is a mix of capital roman letters in the word "FRAU" with oblique letters in the rest of the line. Oblique letters are similar to italic letters, but instead of being written in a more cursive-like style, they're merely slanted a bit to the right.

The other interesting detail is easier to see if we color in blocks of adjacent pixels:

two_line_subtitle_italic_colon_segmented.png

Note how the the "rt" in "Worte" is all one color. This looks like a ligature. This often happens when a font designer thinks that two letter shapes would be more visually pleasing if they were drawn together as a single shape.

There's something similar going on with the "RA" in "FRAU", but this is probably just accidental pixel overlap caused by kerning, where the distance between two letters is manually adjusted to make them more visually pleasing.

The kerning in this font, by the way, is fairly ugly. The letter spacing is irregular and not very pleasing. And so I'll leave you with this parting xkcd:

Image
You do not have the required permissions to view the files attached to this post.
1 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6330
Contact:

Finding data for language models

Postby emk » Mon Mar 20, 2017 12:17 am

In order to OCR subtitles, I'll need to know what each language is supposed to look like! This means I need to find a huge amount of text in each language. Ideally, that text should be casual, conversational text.

The OPUS project is an open source corpus for machine translation research, and they provide a 75GB database (compressed) of subtitles from OpenSubtitles projet. This is stored in a custom XML format, gzipped, packed into a tar archive, and (just for fun) gzipped again.

So I went ahead a wrote a tool that would parse this custom format and emit the subtitles for each language as a single, giant text file with one subtitle per line. You can find my tool here. Here are some sample subtitles from Catalan:

Code: Select all

 Quan redacti la història, no dubti que el citaré.
 Em sap greu, Sydney...
 Pran, no n'hi ha prou...
 Amb "no" no n'hi ha prou, és una gran història... ho entens?
 Hi hem d'anar.
 - Ja ho sé...
 - No vull sentir "no".
 Vull que em diguis que hi arribarem, ja hi hauríem de ser.
 T'espero aquí.
 Bona nit, Sydney.
 - Missatges?
 - No, senyor.
 No, senyor.
 Ha estat un dia molt dur, he de tornar...

There's a truly massive amount of data here, covering over 60 languages:

Code: Select all

Extracted 28435155 sentences from 30502 files.
Extracted 23611196 sentences from 24060 files.
Extracted 60840749 sentences from 67371 files.
Extracted 26782811 sentences from 27605 files.
Extracted 80140630 sentences from 90319 files.
Extracted 79320 sentences from 89 files.
Extracted 112360292 sentences from 124815 files.
Extracted 22917237 sentences from 23492 files.
Extracted 229583 sentences from 188 files.
Extracted 7335505 sentences from 6438 files.
Extracted 38677592 sentences from 44584 files.
Extracted 101502145 sentences from 114150 files.
Extracted 245104 sentences from 370 files.
Extracted 81728 sentences from 57 files.
Extracted 71564854 sentences from 79609 files.
Extracted 90560244 sentences from 104714 files.
Extracted 1113 sentences from 1 files.
Extracted 12364402 sentences from 10997 files.
Extracted 1749857 sentences from 1282 files.
Extracted 88640241 sentences from 96735 files.
Extracted 2280708 sentences from 2606 files.
Extracted 262003 sentences from 271 files.
Extracted 2522 sentences from 2 files.
Extracted 718564 sentences from 657 files.
Extracted 1751246 sentences from 1544 files.
Extracted 498971 sentences from 392 files.
Extracted 5824206 sentences from 5626 files.
Extracted 310479 sentences from 251 files.
Extracted 1291617 sentences from 1029 files.
Extracted 80940652 sentences from 99187 files.
Extracted 8823761 sentences from 8915 files.
Extracted 76876586 sentences from 96359 files.
Extracted 337845355 sentences from 323905 files.
Extracted 84696889 sentences from 98031 files.
Extracted 32340274 sentences from 38511 files.
Extracted 606881 sentences from 511 files.
Extracted 142933545 sentences from 160986 files.
Extracted 91705823 sentences from 96254 files.

...and so on. You could spend a long time on Anki reviews! :lol:

Now that I have this, my next goal is to extract a probabilistic model for each of the 65 languages from this data. To begin with, I'll want:

  1. The frequency of each individual letter.
  2. The frequency of each pair of letters.
  3. The frequency of the 5,000 most common words.
This should make it easy for me to compare potential OCR output against what the language should look like.

Also, approaching the problem from the opposite direction, I've written some code that attempts to attach dots to i's, and other diacritics to the appropriate letter. I haven't tested it thoroughly yet, though. There are a lot of little pieces to implement here!
1 x


Return to “Language Programs and Resources”

Who is online

Users browsing this forum: No registered users and 2 guests