substudy: Make Anki cards and other resources from video & bilingual subtitles (command-line)

Postby **emk** » Sat Dec 17, 2016 8:51 pm

rdearman wrote:Better yet would be for substudy to generate a set of files using the Pimsleur algorithm which could be burned to disk and played in the car.

I had a look at the Gradint page. It looks cool, but there are so many buttons!

The design goal for substudy is to make intelligent decisions for you, with as few customization knobs as I can manage (without being obnoxiously inflexible). So I'd lean towards adding a new "substudy export <format>" command. Probably "substudy export tracks" is the closest existing command for your purposes.

As for the actual ordering of audio on the tape, I'm not convinced that SRS-style spacing would add all that much, because TV episodes tend to be short (once the non-dialog audio has been stripped), and you should probably just make one "album" per TV episode and then review entire episodes at gradually increasing intervals. But if you've seen significant wins with another format, I'd be interested in hearing about your experiences. And you may get enough "SRS"-style repetition just from the natural repetition between episodes.

I've also wanted to try an audio-only format that plays each clip as "foreign / native / foreign again" and then moves to the next clip.

Postby **rdearman** » Mon Dec 19, 2016 4:24 pm

emk wrote:I've also wanted to try an audio-only format that plays each clip as "foreign / native / foreign again" and then moves to the next clip.

I think you've described much more succinctly what I actually wanted.

masone · Postby **masone** » Thu Jan 12, 2017 1:28 am

For Windows people, you can do something similar with Sony Vegas. This was the end result.

I haven't done this in over a year, so excuse me if my description is off. It's based off a faded memory.

I had an mp4 file in Vegas. Then I believe I used a Vegas plugin called Vegasaur to import the srt file of that mp4, which automatically created text events with the subtitle text at the timestamps in the srt file. Then I imported the srt file a second time, I believe using another part of Vegasaur(or a component inherent to Vegas), which then split the video file at the timestamps in the srt file. Then I exported only the parts of the video that had text events, which gave me only the dialogue. So I had a folder full of 1-3 second long videos. Then I'd study them however I saw fit. It actually worked quite well, and I'm going to start doing it again now that I'm learning again.

You can easily add L1 above the L2 in its own track in Vegas using this method. It'd look like the image from the OP.

Not quite as comprehensive as emk's program, but maybe someone more adept can expand on this. Take up the baton, so to speak. I'll be doing this later on so if I misremembered anything, I'll be sure to correct it.

Postby **emk** » Fri Jan 27, 2017 8:26 pm

I mentioned this in my log, but I realized I should mention it again in the substudy thread.

emk wrote:A substudy note

At the moment, I'm looking for a largish archive of subtitles in VobSub format. These are usually two files, one named "*.sub" and one named "*.idx", and they're an image-based subtitle format, not a text-based format like the *.srt files you normally use with Subs2SRS or substudy. I have some ideas on how these could be more efficiently converted into *.sub format, but I would need lots of examples to test. Does anybody have a bunch of these lying around, or know where I could find them?

I have a vision for making substudy easy to use and more effective for both beginners and advanced students—some day!—but there are a lot of tricky little pieces that will need training data. Any help you can provide would be appreciated!

Also a periodic reminder for Windows users: Subtitle Edit is free and open source, and it does a brilliant job of turning dodgy subtitles into good ones.

Stefan · Postby **Stefan** » Sat Jan 28, 2017 9:37 am

I believe I've got .sub/.idx for 10 movies if it would be of interest?

Postby **emk** » Sat Jan 28, 2017 1:47 pm

Stefan wrote:I believe I've got .sub/.idx for 10 movies if it would be of interest?

That would be great!

Anybody who has some sub/idx files to share with me is welcome to upload them here. Please let me know in advance if you're going to upload more than a gigabyte!

There's a problem with the Sub2SRS/substudy approach that tuckamore describes well:

tuckamore wrote: I had used Subs2srs in the past with huge success. Now, I'm toeing the line from intermediate to advanced (or at least I'd like to think I am) and, if I were to use audio+subs+Anki, I would have the specific goal of mining native audio for specific vocabulary that I want to hear in different contexts. So, as you say, I would "need to prepare far more video for the same results", and this is where I'm questioning the practicality of my proposal.

When you're just starting out, you can usually find TV series or films where there's something new and interesting in almost every line of dialog. So spending a couple hours to rip a few DVDs, OCR the subtitles, and load everything into Anki makes sense. But when you reach a higher level, only a small fraction of the dialog is interesting. (I would maybe get 10 cards out of an episode of Buffy at this point. More if I went with Engrenages or Le Trône de fer, of course!) Even if I could somehow prepare a deck of cards from a film with a single button press, it would still be too much work just to go through them and delete 95%.

So I'm making no promises. But I think the general idea of audio cards made from native media is still very useful all the way to C1 and beyond. It's just that the process of making cards needs to be vastly simplified for it to be worth the effort. In a perfect world, I'd love to have a tool that does for video what readlang does for audio, except with a spaced repetition system as the central feature. Watch your videos normally on your computer, and when you miss a bit of dialog, hit a button and rewind, then make a card with a single click.

However, I do not currently believe there is much money in this (for example, readlang was an enormous amount of work and I understand it wasn't very profitable in relation to the effort invested), so I don't plan to build the whole thing any time soon. Here are the different things I imagine would need to happen:

Video and subtitle ripping. Handbrake does an excellent job of this already, except for the fact that you actually need to fiddle around to get the subtitle tracks ripped alongside the video. You may also need a region free DVD drive for your computer; I'm not sure.
Subtitle OCR. You can do this with Subtitle Edit (and a dozen other programs), but all require some manual fiddling, editing and cleanup. This step could be significantly better, and there are some nice juicy technical challenges here. I would really love to write an open source Rust library that advanced the state of the art in automatic subtitle OCR.
A video player GUI. I've made several partial sketches of this idea, including here (using tools that ultimately proved too eccentric and annoying) and here (which is an actual native, cross-platform app, solving a ton of design issues!).
Spaced Repetition Support. Anki is still the gold standard for this, but readlang had a tiny, built-in SRS tool that made it easy for people to get started. And indeed, I spend so much time explaining how to use Anki (delete! delete! automated card creation! don't fail more than a tiny handful of cards! delete leaches!) that it might be useful to have a simpler tool for beginners with some of this built it.

Right now, I'm interested in the technical challenges of (2), and in the libraries and techniques needed to do a good job of (3). I'm not committing to solving the entire problem! But any sub/idx files you upload here will help motivate me to work on (2).

kaegi · Postby **kaegi** » Sun Jan 29, 2017 12:55 pm

emk wrote: So I'm making no promises. But I think the general idea of audio cards made from native media is still very useful all the way to C1 and beyond. It's just that the process of making cards needs to be vastly simplified for it to be worth the effort.

I'm sorry for the self-advertisement, but I invested some time to create tools to make that process easier. First there is SubtitleMemorize, which is more or less subs2srs with more batch processing/integrated subtitle correction capabilities. It's most notable feature is automatic subtitle-to-subtitle correction which figures out the best offsets of the subtitle and removes/introduces breaks to get the best alignment possible. I already announced it on this forum.

I don't think the cost-gain relationship is very good for movies, so I only ever Ankify series. I often found myself with correct English subtitles for the series and was able to find (in my case) Japanese translations for them online, but the timings were wrong. With SubtitleMemorize I can now download the whole batch of subtitles, enter them into SubtitleMemorize at once, say "Align Sub2 to Sub1" press "Go" and have for the series about 10000 cards of perfect Anki flashcards.

In Anki I use MorphMan which sorts the cards by number of unknown vocabularies in it and skips over cards with 0 unknown words. In my case: I have about 5% mature submem cards, and MorphMan skips over 55% of the cards (this means I understand half of the dialog in that series....). This reordering makes studying _very_ effective. Example: take a 26-episodes series with its 10000 cards. Even if we (automatically) discard 95% cards, we still have 500 cards left worth studying. The "cost" is finding the subtitles online, about <10 minutes of manual work in submem and the press of a button in MorphMan.

Back to the subtitle-correction feature: SubtitleMemorize only handles .srt and .ass files and, even though the algorithm works wonderfully, it is bad from a engineering standpoint (no theoritical runtime limit, no underlying mathematical model). I'm currently developing a standalone Rust library/CLI-binary that corrects these mistakes. The program is mostly done and I'm in the polishing phase so I will release it soon-ish.

Possible workflow if that program is finished:

extract the video and .idx/.sub files from the DVD
download the [target language] and [native language] subtitles in .srt or .ass format from the internet (e.g. OpenSubtitles.org)
use "subalign" to automatically align the .srt/.ass subtitles to the .idx
use the video and corrected .ass/.srt files in SubtitleMemorize/substudy/sub2srs
sort/filter cards with MorphMan

None of these steps require per-episode work (if we call "subalign" in a bash-for-loop), so the manual work is about the same no matter the length of the series. The longer the series, the better the cost/gain relation.

"subalign" is partly developed with a easy substudy integration in mind. If you don't mind I can post a note in this thread when I publish it.

Postby **emk** » Sun Jan 29, 2017 2:34 pm

kaegi wrote:"subalign" is partly developed with a easy substudy integration in mind. If you don't mind I can post a note in this thread when I publish it.

This is a very promising idea to fix alignment, and it should massively reduce the manual effort required! Please do let us know when it's released. I'd like to link to it from the substudy page.

kaegi wrote:Possible workflow if that program is finished:
extract the .idx/.sub files from the DVD
download the [target language] and [native language] subtitles in .srt or .ass format from the internet (e.g. OpenSubtitles.org)
use "subalign" to automatically align the .srt/.ass subtitles to the .idx
use the corrected .srt files in SubtitleMemorize/substudy/sub2srs

My long-term ideal would be to replace all these steps with:

Insert DVD with subtitles.
Push "Play".

This is to support my watching of series like Le Trône de fer, where I mostly want to watch TV normally, but I'm still missing some chunk of the faster dialog. But to make this really seamless would require much better support for automatic idx/sub conversion. Anyway, it will give me something to work on during cold winter nights. :lol:

kaegi · Postby **kaegi** » Sun Jan 29, 2017 8:15 pm

But to make this really seamless would require much better support for automatic idx/sub conversion. Anyway, it will give me something to work on during cold winter nights.

Do you plan to write a general OCR program (also for Chinese/Japanese/other scripts), or just for Latin script? To make that program work really seamlessly, it would have to distinguish 攤 (broaden) from 灘 (open sea), 懽 (rejoice) from 權 (authority) and hundreds of other complicated but visually similar characters even at a low resolution.

I think most readers throw context information into the mix when trying to recognize these characters in small text. I think this is where even state-of-the-art OCR programs possibly fail, though I didn't test it.

kaegi · Postby **kaegi** » Sun Jan 29, 2017 8:29 pm

Ahh, I forgot to mention: On http://kitsunekko.net/ you can find thousands (ten thousands?) of .srt/.ass subtitles in Japanese and English for (mostly) Anime. On the main page you can download all (2,4 GB) subtitles or with a simple wget query you can mirror parts of the website! With the right tools (?) you can probably automatically create a .sub version of them and try to apply your OCR program and check how many words are still the same. This should yield a broad testing ground.

A language learners’ forum

substudy: Make Anki cards and other resources from video & bilingual subtitles (command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Who is online