substudy: Make Anki cards and other resources from video & bilingual subtitles (command-line)

All about language programs, courses, websites and other learning resources
User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6333
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Thu Nov 26, 2015 8:27 am

Because I can sometimes be a total show-off, especially when it comes to code:

Code: Select all

substudy export review avatar_01_01.mkv avatar_01_01.es.srt

…will produce a monolingual review page with Spanish audio and Spanish text.

Code: Select all

substudy export review avatar_01_01.mkv avatar_01_01.es.srt avatar_01_01.en.srt

…will produce a page with Spanish audio and bilingual subtitles.

Code: Select all

substudy export review avatar_01_01.mkv avatar_01_01.en.srt avatar_01_01.es.srt

…will produce a page for Spanish speakers learning English, with English audio and bilingual subtitles. All you have to do is flip the order of the subtitle files.

The trick here is that (1) the "mkv" video file contains both a Spanish and an English audio track, and (2) the program uses Google's compact language detector to figure out what language the subtitles are in. So assuming that you're studying a reasonably major language, and that your audio tracks have accurate language tags (which happens surprisingly often), then you don't have to fiddle around with audio tracks—everything just works. And as I mentioned above, if your subtitle files are in non-standard character set encodings (which they often are), it will try figure out the encoding automatically and convert to UTF-8.

We're rapidly building up a library for working with subtitles, time periods, and media files. This opens up all kinds of possibilities for people to explore—for example, somebody could implement a mode that extracted the audio from a TV show, and removed all the parts that didn't contain dialog. Or they could fix it to generate an audio file that played sections of dialog in Spanish, then replayed those sections in English, then switched back to Spanish one final time, before moving on to the next section of dialog. I'm unlikely to explore all these possibilities myself, but if somebody else wants to implement them, I'd be interested in reading your language log! :-)
0 x

User avatar
arthaey
Brown Belt
Posts: 1080
Joined: Sat Jul 18, 2015 9:11 pm
Location: Seattle, WA, USA
Languages: :
EN (native);
ES (adv receptive, int productive);
FR (false beginner);
DE (lapsed beg);
ASL (lapsed beg);
HU (tourist)
Language Log: viewtopic.php?f=15&t=3864&view=unread#unread
x 1675
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby arthaey » Thu Nov 26, 2015 12:13 pm

Just wanted to pop in for a quick note to let you know that I didn't get a chance to look at the code today/yesterday.

Instead, I finished splicing together all the YouTube video snippets of "La Légende du Roi Arthur" into a single 2-hour video. When I wake up, I'll be doing subtitles for it. Oh, and Thanksgiving. ;)

So I relinquish CVS export back to into your capable hands, emk! Amazing work so far. :)
0 x
Posts in: FrenchGermanHungarianSpanish
NaNoWriMo: 10,000 words
Corrections welcome in any language; I prefer an informal register.

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6333
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Fri Nov 27, 2015 3:25 pm

arthaey wrote:Instead, I finished splicing together all the YouTube video snippets of "La Légende du Roi Arthur" into a single 2-hour video. When I wake up, I'll be doing subtitles for it. Oh, and Thanksgiving. ;)

Sounds like you'll have some great video and subtitles soon!

I am pleased to announce that substudy 0.3.0 has been released, adding support for CSV export! Here's what it looks like in Anki:

lang-substudy-csv-anki-import.png

And here are the currently available commands:

Code: Select all

Subtitle processing tools for students of foreign languages

Usage: substudy clean <subs>
       substudy combine <foreign-subs> <native-subs>
       substudy export csv <video> <foreign-subs> [<native-subs>]
       substudy export review <video> <foreign-subs> [<native-subs>]
       substudy list tracks <video>
       substudy --help

For now, all subtitles must be in *.srt format. Many common encodings
will be automatically detected, but try converting to UTF-8 if you
have problems.

For example, I can run:

Code: Select all

substudy export csv avatar_01_01.mkv avatar_01_01.es.srt avatar_01_01.en.srt

…and this will create a directory containing a "cards.csv" file and a bunch of media files, which you can import into Anki following the old subs2srs import instructions except you don't need to reverse the "Sound" and "Source" fields. At some point, it would be really nice to add support for decks in "apkg" format, which I think you can just double-click and import directly into your existing Anki collection without messing with card templates or CSV import. But the "apkg" format is tricker. Or perhaps somebody wants to write an Anki plugin that calls substudy for the user, and sets up all the templates, etc., automatically?

You may notice, however, that my command-line interface is far simpler that the subs2srs GUI, mostly because I just go ahead and choose good defaults, based on doing many thousands of subs2srs reps. But I also remove some of the "cleanup" options, which I feel are much better done using a real subtitle editor like SubtitleEdit.

It's now a lot easier to write exporters, by the way:

I'm really tempted to add an exporter that chops an episode up into a bunch of MP3 files, with maybe 1 file per conversation, for playing in the car or when walking. A more ambitious developer could create different formats: Mixed L1 and L2 audio, or L2 audio with spaces in between, to make it easier to echo back what you're hearing.
You do not have the required permissions to view the files attached to this post.
2 x

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6333
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Sun Nov 29, 2015 5:00 am

Does anybody remember how Khatzumoto, back in his early AJATT days, said that he liked to rip the soundtrack of movies and TV shows to MP3 files, and then listen to interesting material on a loop? I've written a few scripts to do this over the years, but none of them were entirely satisfactory, because there's often a ton of "dead" space in between dialog, with car chases, explosions, etc. I've always wished I could just extract the parts with actual dialog.

Check out these audio files generated by the latest prototype of substudy:

lang-substudy-export-as-tracks.png

This was generated using:

Code: Select all

substudy export tracks avatar_01_06.mkv avatar_01_06.es.srt

Features:

  • Audio is broken into chunks of roughly 30 seconds or less, so you can use the "previous track" and "next track" buttons on your MP3 player.
  • Adjacent tracks are output as seamless-transition MP3s, so you don't hear any break or overlap in the middle of dialog.
  • Tracks are named after the first snippet of dialog they contain, and the "lyrics" field of the tracks is filled with the complete text (for those players that support it).
I'm going to test this for a few days before I release it, just to make sure I've chosen good parameters.
You do not have the required permissions to view the files attached to this post.
1 x

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23128
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby rdearman » Sun Nov 29, 2015 7:37 pm

Works very well. I've done a couple of small episodes from my avengers cartoon DVD. I've done EN-FR, EN-IT, and even FR-IT. The most time consuming part was using SubEdit to convert the sub-titles into SRT format.
2 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6333
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Sun Nov 29, 2015 11:13 pm

rdearman wrote:Works very well. I've done a couple of small episodes from my avengers cartoon DVD. I've done EN-FR, EN-IT, and even FR-IT. The most time consuming part was using SubEdit to convert the sub-titles into SRT format.

Excellent! Somebody other than me actually used substudy and it worked? :-) Once you get reasonably efficient with Subtitle Edit, you can get through the entire process in under an hour, in exchange for 250 good Anki cards. Or if you have to spend more time fixing the subtitles, it's more-or-less actual language-learning time as you go through and match up the two languages against each other and the audio.

I'm going to be mostly away from the forum for the coming week (too much work, plus my Spanish is really repaying the effort I put in, and I have one big substudy idea I want to try, and...), but I'll try to drop by periodically and answer any questions.

The big limitation of the current version of substudy is that it focuses heavily on comprehension. But I know it could do more for activation and grammar, if it could support more MCD-like card formats. Maybe somebody, if enough people are interested. :-)
1 x

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6333
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Sun Dec 06, 2015 11:19 pm

Hmm, what could I do if I could convert subtitles into browser-friendly JSON data?

lang-substudy-json.png

That hurt a lot more than it should have, because my web server library didn't support "Content-Range:" headers, which are needed to serve up video files.

Let's see where I can take this. :-)
You do not have the required permissions to view the files attached to this post.
1 x

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23128
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby rdearman » Sun Dec 06, 2015 11:40 pm

emk wrote:Hmm, what could I do if I could convert subtitles into browser-friendly JSON data?

lang-substudy-json.png

That hurt a lot more than it should have, because my web server library didn't support "Content-Range:" headers, which are needed to serve up video files.

Let's see where I can take this. :-)

For what purpose would this be useful for?
0 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6333
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Mon Dec 07, 2015 12:01 am

rdearman wrote:For what purpose would this be useful for?

Because it's the first step to building a web-based UI for working with video and subtitles:

lang-substudy-stub-ui.png

The next challenge is capturing the video's current playback time, and highlighting the corresponding subtitle. After that, things get interesting, at least for me... :-)
You do not have the required permissions to view the files attached to this post.
0 x

User avatar
emk
Black Belt - 1st Dan
Posts: 1622
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6333
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Mon Dec 07, 2015 11:47 am

emk wrote:The next challenge is capturing the video's current playback time, and highlighting the corresponding subtitle. After that, things get interesting, at least for me... :-)

And there we go:

lang-substudy-multiple-subs-small.png

Now, what if I could add a checkbox next to each subtitle, and then an "Export" button at the end of the movie?

I actually want a version of this tool for my French, too. I'd love to hide the subtitles until I missed a bit of dialog, then hit a "Replay" button to rewind a bit, turn on subs, and allow me to mark a subtitle or two for export. It would help with Le Trône de fer and Kaamelott, certainly—I still have listening gaps on that kind of "edgy" television, and the subs2srs card format is the single most effective (and efficient) tool I know about for intensive listening.

Oh, and the "Tracks" format that I mentioned earlier, which takes the dialog of a movie or episode and turns it into ~30-second MP3 files? It's proving really awesome, and I can just leave it looping in the background. Somehow, stripping out the long stretches of dialog-free audio actually makes it less distracting. I made it through Matando Cabos several times yesterday afternoon while working on other stuff, and it didn't bother me at all. On the second or third listen, I started picking out some small chunks of dialog, which isn't too bad for a movie I've never watched in a language that I've just started.

Some personal hypotheses driving this project

  1. Language learning doesn't start for real until you can watch TV for fun. (I'm sort of kidding, but not really. :-) )
  2. Most learners would benefit enormously from working with comprehensible native audio very early in the process.
  3. That weird effect where people memorize a top 40 song (including the singer's intonation) without even trying to do so should be used much more often in language learning.
  4. subs2srs-style cards are the single best tool I know for intensive listening. The only thing that comes close is trying to transcribe unknown audio, but I've done experiments with fast MC Solaar songs, and I can see a clear difference: The songs I transcribed manually, I understand fairly well. The songs I tackled with subs2srs-style cards, however, I often know down to the last syllable.
Anyway, it's fun playing around with this stuff, and trying out various experiments.
You do not have the required permissions to view the files attached to this post.
3 x


Return to “Language Programs and Resources”

Who is online

Users browsing this forum: No registered users and 2 guests