substudy: Make Anki cards and other resources from video & bilingual subtitles (command-line)

All about language programs, courses, websites and other learning resources
User avatar
language2015
Yellow Belt
Posts: 91
Joined: Sun Oct 11, 2015 4:08 am
Languages: English (N) French (A0) Spanish (A0)
x 106

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby language2015 » Tue Nov 24, 2015 4:53 am

arthaey wrote:
language2015 wrote:All of this programmer stuff you talk about...

Where did you learn it? Self taught? College?

I would love to know. I want to learn all of this programming magic but have no idea how to begin learning.

I myself learned in college. However, I've worked with many, many professionals who were self-taught. And outside of professionals, there are even more self-taught folks who put together useful software as a hobby.

If you're serious about wanting to learn programming, there are tons of resources out there. My advice would be to have some personal project, some software that you want to exist, and focus your learning energies on what's needed for that goal. Also, once you have a more specific idea than "programming", you can get more specific advice. ;)


I just want to understand what emk was typing about...

Honestly, I don't know if I really want to be a programmer but I feel that if I understood enough information I would be able to make what I needed.
0 x
Why did I decide to become a polyglot when I knew I am a super lazy student?!?

Anki or Die

: 700 / 10000 Mine 10,000+ Spanish Sentences
: 700 / 10000 Mine 10,000+ French Sentences
: 80 / 10000 Anki All Of It

User avatar
language2015
Yellow Belt
Posts: 91
Joined: Sun Oct 11, 2015 4:08 am
Languages: English (N) French (A0) Spanish (A0)
x 106

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby language2015 » Tue Nov 24, 2015 4:57 am

emk wrote:
Anyway, here's one more bit of progress, a "review mode". This will eventually look much different, and include some English subs, too:

lang-substudy-review-mode-first-draft.png

Obviously, I want to add the English subs, too. The idea is that I could use this page to skim through an episode, looking up anything I'm curious about, and replaying specific lines. Now that we have the ability to align subtitles and extract small media clips, there are a lot of interesting possibilities!


Yeah that looks super baller.

Lets talk business man. How many hours would it take for you to build this into an app or anki add on?

Then lets talk money.

This app has potential I can feel it. This is funny because I remember some guy's post about learning more efficiently. I didn't think he would be able to reduce the hours needed to learn a language by any meaningful percentage but now I think I am horribly wrong. Combining fun native content with srs in this manner seems deadly. Deadly and revolutionary.
1 x
Why did I decide to become a polyglot when I knew I am a super lazy student?!?

Anki or Die

: 700 / 10000 Mine 10,000+ Spanish Sentences
: 700 / 10000 Mine 10,000+ French Sentences
: 80 / 10000 Anki All Of It

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6337
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Tue Nov 24, 2015 9:29 pm

Thank you, everybody, for the kind words! I had a chance to do a little more work before bed and after lunch:

lang-substudy-review-bilingual.png

Clicking on the "Play" button plays the audio clip, not a video clip. I tried video clips with sub2srs and it just took up a lot of extra space without actually helping any.

lang-substudy-review-bilingual-2.png

Each of the image + dialog sets would form the heart of a flash card. The front of the card would have:

  • The image.
  • The audio (with enough padding to give context and work around alignment errors).
The back would have:

  • The Spanish subtitle (plus one line before & after for context).
  • The English subtitle (plus one line before & after for context).
The big challenge at the moment is figuring out how to use ffmpeg to perform mass subtitle image and audio extraction efficiently. Certain versions of ffmpeg seem to be very ill-behaved—they gobble up 10gb of memory rapidly, and then kill my machine. The newest version seems to work better, however, but it's still too slow. ffmpeg has various options for quickly moving to the right part of a file, which might be worth investigation later.
You do not have the required permissions to view the files attached to this post.
2 x

User avatar
Serpent
Black Belt - 3rd Dan
Posts: 3657
Joined: Sat Jul 18, 2015 10:54 am
Location: Moskova
Languages: heritage
Russian (native); Belarusian, Polish

fluent or close: Finnish (certified C1), English; Portuguese, Spanish, German, Italian
learning: Croatian+, Ukrainian; Romanian, Galician; Danish, Swedish; Estonian
exploring: Latin, Karelian, Catalan, Dutch, Czech, Latvian
x 5179
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby Serpent » Tue Nov 24, 2015 11:53 pm

emk, arthaey et al: have you thought of offering a service where you simply run these tools (like subs2srs) for someone and they get an anki deck or whatever the actual output is?
2 x
LyricsTraining now has Finnish and Polish :)
Corrections welcome

peterbeischmidt
Yellow Belt
Posts: 62
Joined: Tue Oct 06, 2015 8:25 am
x 105

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby peterbeischmidt » Wed Nov 25, 2015 1:38 am

Serpent wrote:emk, arthaey et al: have you thought of offering a service where you simply run these tools (like subs2srs) for someone and they get an anki deck or whatever the actual output is?


Would you upload a 2 GB video file to a website in order to get a bunch of flash cards? ;)

Jokes aside, this seems very impractical. You would probably also be breaking copyright laws with every video you process.

But given what has been done in the browser, maybe some Java Script ninja can compile ffmpeg to Java Script and simply do they entire job in the browser without having to transfer any files?
0 x

User avatar
arthaey
Brown Belt
Posts: 1080
Joined: Sat Jul 18, 2015 9:11 pm
Location: Seattle, WA, USA
Languages: :
EN (native);
ES (adv receptive, int productive);
FR (false beginner);
DE (lapsed beg);
ASL (lapsed beg);
HU (tourist)
Language Log: viewtopic.php?f=15&t=3864&view=unread#unread
x 1675
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby arthaey » Wed Nov 25, 2015 1:55 am

peterbeischmidt wrote:compile ffmpeg to Java Script

:shudder: NOT IT! ;)
1 x
Posts in: FrenchGermanHungarianSpanish
NaNoWriMo: 10,000 words
Corrections welcome in any language; I prefer an informal register.

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6337
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Wed Nov 25, 2015 3:05 am

Serpent wrote:emk, arthaey et al: have you thought of offering a service where you simply run these tools (like subs2srs) for someone and they get an anki deck or whatever the actual output is?

peterbeischmidt wrote:Would you upload a 2 GB video file to a website in order to get a bunch of flash cards? ;)

Yeah, the two big problems with a web-based solution would be upload bandwidth and copyright laws. I really wish there were a way to prepackage this stuff. But I'm making all my code available, without restrictions, so if anybody can figure out a way to make this simpler, they're welcome to try.

Some more progress, this time with El laberinto del fauno:

lang-substudy-review-pan.png

Since my last post, I've figured out ffmpeg. Before, it took 2 or 3 hours to extract images and audio clips from an episode of Avatar, and ffmpeg was gobbling up over 10 GB of RAM. Now I can process an episode in 30 to 45 seconds, and El laberinto took a bit under two and a half minutes.

This "review" format is actually really nice way to go through an episode or movie, and refresh my memory of my old Anki cards. It's sort of halfway between Anki and watching with bilingual subs.

I need to polish off a few more rough edges, but I'll release a new version soon. (Now that I think about it, it would also be possible to create a local, browser-based video player that allowed me to watch a video straight through, and to mark individual subtitles for export to Anki as I watched. But that would take more work.)
You do not have the required permissions to view the files attached to this post.
2 x

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6337
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Wed Nov 25, 2015 12:39 pm

Version 0.2.0 is released, with support for the "review" format you see above!

Code: Select all

substudy export review avatar_01_01.mkv avatar_01_01.es.srt avatar_01_01.en.srt

This will create a directory named "avatar_01_01_review" containing an "index.html" file you can open in your browser. To get this to work, you'll need to install a recent version of ffmpeg—older versions of ffmpeg, as well as any versions of the libav fork, may or may not fail horribly and gobble up all your memory.

So at this point, substudy does two useful things: (a) it generates bilingual srt files for use with various media players, and (2) it creates "review" pages where you can skim through the subs and listen to individual clips. It would be pretty easy to add an "anki" export mode that wrote out an entire Anki deck—CSV format would be easiest, but it would very user-friendly if we could reverse-engineer the Anki deck format ".apkg" and just generate an importable deck directly.

Oh, and I'm looking for a Mac-based developer (or comfortable terminal user) who actually wants to try and follow the installation instructions, to see if we can get a Mac version working. I'll also try to set up Travis CI to automatically compile each release on the Mac, to make sure I don't break anything once we get it working.

EDIT: It compiles on the Mac! I'm using Travis CI to automatically verify that the Mac build is working. You'll still need to install brew and cmake to build it, and ffmpeg to use the media features. But if anybody figures it out, we can write up a step-by-step guide for Mac users.
4 x

User avatar
arthaey
Brown Belt
Posts: 1080
Joined: Sat Jul 18, 2015 9:11 pm
Location: Seattle, WA, USA
Languages: :
EN (native);
ES (adv receptive, int productive);
FR (false beginner);
DE (lapsed beg);
ASL (lapsed beg);
HU (tourist)
Language Log: viewtopic.php?f=15&t=3864&view=unread#unread
x 1675
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby arthaey » Wed Nov 25, 2015 8:40 pm

emk wrote:Version 0.2.0 is released, with support for the "review" format you see above!

Woohoo! :)

emk wrote:

Code: Select all

substudy export review avatar_01_01.mkv avatar_01_01.es.srt avatar_01_01.en.srt

Yay for simple usage!

emk wrote:you'll need to install a recent version of ffmpeg—older versions of ffmpeg, as well as any versions of the libav fork, may or may not fail horribly and gobble up all your memory.

What versions work for you? What versions caused the horribleness? It'd be nice to know beforehand if what I already have installed is sufficent.

emk wrote:It would be pretty easy to add an "anki" export mode that wrote out an entire Anki deck—CSV format would be easiest, but it would very user-friendly if we could reverse-engineer the Anki deck format ".apkg" and just generate an importable deck directly.

You seem to be on a roll here, so I don't want to duplicate your dev efforts. But if you're not actually planning on doing the CSV export today, I may take a crack at it later.

I disagree on the .apkg format being better, though. I almost always want to import data into an existing deck, not create a new one. (Having both options would be best, of course.)

emk wrote:if anybody figures it out, we can write up a step-by-step guide for Mac users.

Now that substudy generates these review pages, I'm sufficiently interested to install it. Probably later today. I'll take notes as I go!
1 x
Posts in: FrenchGermanHungarianSpanish
NaNoWriMo: 10,000 words
Corrections welcome in any language; I prefer an informal register.

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6337
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Wed Nov 25, 2015 10:44 pm

arthaey wrote:
emk wrote:

Code: Select all

substudy export review avatar_01_01.mkv avatar_01_01.es.srt avatar_01_01.en.srt

Yay for simple usage!

Thanks! My theory is that this shouldn't have a zillion command-line flags. It's willing to go to some pretty ridiculous lengths to figure things out for you. If your data isn't in UTF-8 format, for example, it will run an encoding detector and convert automatically if possible. Or if your video file contains multiple audio files tagged with different languages, it's soon going to run Google's human language detector on the subtitles (which is extremely accurate given that much text) and try to pick the right track by default. I'm currently working on that right now. Why have these.computer things if they're just going to ask a hundred incomprehensible questions? :-)

As for the ffmpeg version: avoid libav in favor of real ffmpeg. Version 2.8.x should be pretty safe. You can try whatever you've got, but keep a process monitor open it and kill ffmpeg if it tries to eat 2GB of RAM.

arthaey wrote:You seem to be on a roll here, so I don't want to duplicate your dev efforts. But if you're not actually planning on doing the CSV export today, I may take a crack at it later.

The final version of the CSV export code will involve a somewhat messy refactoring to support multiple exporters. I'm not sure that I want to inflict that on somebody else. :-)

But if you want to hack together a simple CSV export that piggybacks on the "review" exporter, it would be a great warmup exercise. To do this, you would want to look up BurntSushi's awesome Rust csv library, and add an entry for it to the Cargo.toml file, and then an 'extern crate csv;' line in src/lib.rs. Then go to src/export/mod.rs, add a 'use csv;' near the top, and just hack the CSV output directly in to the big 'export' function without worrying about the design.

If you want to import into Anki, you'll probably also want change 'grow(0.5, 0.5)' to 'grow(1.5, 1.5)' to get a bit more context around each clip. You may also want to replace the 'index' in the audio and video file names with a timestamp based on the start time of the audio/video. This will prevent name clashes in Anki by making the file names more unique.

If you get stuck, the main tutorial / manual provided on the Rust site is excellent. And I'm always happy to answer questions, though I reserve the right to do so in French sometimes in your case. :-) I'm also happy to do remote pair programming to help explain the code and show off Rust tricks.

I'm honestly not sure if this is an ideal project to merge--too much cleanup will need to happen elsewhere--but it's a great warmup exercise. It's up to you if you want to take it on.

Also, this will be much easier if you have some prior knowledge of a language like C or C++ (at least heap versus stack, and pointers).

Oh, yeah: The Rust borrow checker will hurt (so good) for the first week or two, but once you make friends with it and stop picking fights, it mostly fades into the background like a badass guardian angel who watches over your code for vile, vile memory ownership bugs, while still letting you get all the speed of a true low-level language. Just don't pick fights with it if you can avoid doing so. :-)
2 x


Return to “Language Programs and Resources”

Who is online

Users browsing this forum: No registered users and 2 guests