substudy: Make Anki cards and other resources from video & bilingual subtitles (command-line)

All about language programs, courses, websites and other learning resources
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6323
Contact:

substudy: Make Anki cards and other resources from video & bilingual subtitles (command-line)

Postby emk » Thu Nov 19, 2015 4:52 pm

substudy-examples.jpg

Homepage

substudy is a command-line tool for working with subtitles and video, similar to subs2srs. You can use it to make:

  • Anki cards with audio, bilingual text and images.
  • Bilingual subtitles.
  • MP3 audio tracks with only the dialog from a video.
  • "Review" pages showing the text of all the subtitles.
For an introduction and installation instructions, see the substudy page.

Original Post

Do you use MacOS X or Linux? Are you familiar with the command line? Would you like to watch TV with bilingual subtitles? If so, check this out.

Last year, I wrote a command-line tool called substudy, which should make it a lot easier to generate high-quality bilingual subtitle files. Available options include:

Code: Select all

Subtitle processing tools for students of foreign languages

Usage: substudy clean <subtitles>
       substudy combine <foreign-subtitles> <native-subtitles>
       substudy --help

For now, all subtitles must be in *.srt format. Many common encodings
will be automatically detected, but try converting to UTF-8 if you
have problems.

This has a lot of useful features, including:

  1. It automatically detects the encoding of the subtitle files, so you don't have to think (too much) about mixing and matching encodings.
  2. It tries to align and combine parallel subtitles, so that the two languages stay more-or-less in sync.
  3. It adjusts subtitle timing, so that subtitles appear earlier and stick around longer, giving you as much time to read and listen as possible.
  4. It clean up crufty subtitles, including sound effects, speaker names, and other common clutter.
  5. It's free and open source.
For example, if you have the following English subtitles in *.srt format:

Code: Select all

1
00:00:01,968 --> 00:00:04,837
 KATARA:
<i>Water.</i>

2
00:00:04,838 --> 00:00:07,240
<i>Earth.</i>

3
00:00:07,240 --> 00:00:09,208
<i>Fire.</i>

4
00:00:09,209 --> 00:00:12,178
<i>Air.</i>

5
00:00:12,178 --> 00:00:14,713
<i>My grandmother used</i>
<i>to tell me stories</i>

And you have matching Spanish subtitles:

Code: Select all

1
00:00:03,100 --> 00:00:05,091
Agua

2
00:00:05,561 --> 00:00:06,557
Tierra

3
00:00:08,230 --> 00:00:08,841
Fuego

4
00:00:10,326 --> 00:00:11,819
Aire

5
00:00:13,684 --> 00:00:16,583
Mi abuela solia contarme historias de
tiempos antiguos.

You can run:

Code: Select all

substudy combine avatar_01_01.es.srt avatar_01_01.en.srt > avatar_01_01.bilingual.srt

...which will output:

Code: Select all

1
00:00:01,100 --> 00:00:05,091
<i>Water.</i>
<font color="yellow">Agua</font>

2
00:00:05,092 --> 00:00:06,557
<i>Earth.</i>
<font color="yellow">Tierra</font>

3
00:00:06,558 --> 00:00:08,841
<i>Fire.</i>
<font color="yellow">Fuego</font>

4
00:00:08,842 --> 00:00:11,819
<i>Air.</i>
<font color="yellow">Aire</font>

5
00:00:11,820 --> 00:00:16,583
<i>My grandmother used to tell me</i>
<i>stories about the old days; a</i>
<i>time of peace,</i>
<font color="yellow">Mi abuela solia contarme historias de</font>
<font color="yellow">tiempos antiguos.</font>

Notice how the subtitle timings have been adjusted, and text like "KATARA:" has been stripped out entirely.

Who is this for?

This will be most useful around CEFR beginner levels A1 and A2, before you're ready to use just foreign-language subtitles, or no subtitles at all. It's especially useful in combination with Subs2SRS and Anki, which—when used all together—will allow you to watch one or two episodes of an easy TV series within a month of starting your studies.

To use this program, you will need some experience with Unix-like command lines, which means that it's probably limited to MacOS X and Linux users (and exceptionally ambitious Windows developers). For installation instructions, see the README page on GitHub.

Please feel free to use this as a tech support thread. If you can't get it working, or if it produces weird results, please let me know. Enjoy!
You do not have the required permissions to view the files attached to this post.
6 x

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23123
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby rdearman » Thu Nov 19, 2015 5:11 pm

Excellent!
0 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6323
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Thu Nov 19, 2015 6:36 pm

rdearman wrote:Excellent!

Thank you!

If people were interested, this could be enhanced to generate Anki cards without too much trouble. A more interesting challenge would be to add support for extracting audio and still images from video files, which would in turn make it possible to clone subs2srs as a reusable library. Sadly, all of this would require free time.

I'm still convinced that language learners would benefit from much better tools for learning from video. Maybe someday. :-)
2 x

peterbeischmidt
Yellow Belt
Posts: 62
Joined: Tue Oct 06, 2015 8:25 am
x 105

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby peterbeischmidt » Thu Nov 19, 2015 9:06 pm

rdearman wrote:A more interesting challenge would be to add support for extracting audio and still images from video files, which would in turn make it possible to clone subs2srs as a reusable library. Sadly, all of this would require free time.


A while ago I wrote a script which did something similar, and I found that I could easily utilize mplayer:

Code: Select all

-ss <time>: Seek to given time position
--endpos=<[[hh:]mm:]ss[.ms]>
              Stop at given time.

              NOTE: When used in conjunction with --ss option, --endpos time will shift forward by seconds specified with --ss.

              EXAMPLE:

              --endpos=56
                     Stop at 56 seconds.

              --endpos=01:10:00
                     Stop at 1 hour 10 minutes.

              --ss=10 --endpos=56
                     Stop at 1 minute 6 seconds.


This way you can't extract snippets, but on the upside you're not cluttering your hard disk with tens of thousands of audio files that have an average length of 2 seconds.
1 x

User avatar
iguanamon
Black Belt - 2nd Dan
Posts: 2353
Joined: Sat Jul 18, 2015 11:14 am
Location: Virgin Islands
Languages: Speaks: English (Native); Spanish (C2); Portuguese (C2); Haitian Creole (C1); Ladino/Djudeo-espanyol (C1); Lesser Antilles French Creole (B2)
Studies: Catalan
Language Log: viewtopic.php?t=797
x 14189

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby iguanamon » Thu Nov 19, 2015 9:17 pm

In the meantime, those of us who can't code our way out of a wet paper sack can only dream of getting something out of the box ready. I have done this manually as a bilingual text document before. If you could write a program that would display bilingual subs and export them to srs with audio that would require no programming skills for people who aren't programmers- plug and chug, you could make some money. :)
3 x

User avatar
arthaey
Brown Belt
Posts: 1080
Joined: Sat Jul 18, 2015 9:11 pm
Location: Seattle, WA, USA
Languages: :
EN (native);
ES (adv receptive, int productive);
FR (false beginner);
DE (lapsed beg);
ASL (lapsed beg);
HU (tourist)
Language Log: viewtopic.php?f=15&t=3864&view=unread#unread
x 1675
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby arthaey » Sun Nov 22, 2015 1:04 am

Thanks! Starred, watched, and forked. :)

I have a Mac and I'm allergic ;) to Wine, so I haven't used subs2srs. But for my upcoming French studies, I'd like to be able to have Anki cards based on movies/TV. If this tool could become a command-line, Mac-friendly substitute for sbus2srs, that would be awesome...
1 x
Posts in: FrenchGermanHungarianSpanish
NaNoWriMo: 10,000 words
Corrections welcome in any language; I prefer an informal register.

User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6323
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Sun Nov 22, 2015 4:15 am

arthaey wrote:I have a Mac and I'm allergic ;) to Wine, so I haven't used subs2srs. But for my upcoming French studies, I'd like to be able to have Anki cards based on movies/TV. If this tool could become a command-line, Mac-friendly substitute for sbus2srs, that would be awesome...

I decided to try adding video support (in hopes of being able to extract audio clips and still images), and I was promptly sucked down into an ffmpeg nightmare. Investigations are underway.

Of course, the real reason for using Wine/VirtualBox is to get access to the amazing open source SubtitleEdit. If you can find a Mac replacement for that (and I think at least one exists), you can use that to OCR subtitles and fix their alignment.
1 x

User avatar
arthaey
Brown Belt
Posts: 1080
Joined: Sat Jul 18, 2015 9:11 pm
Location: Seattle, WA, USA
Languages: :
EN (native);
ES (adv receptive, int productive);
FR (false beginner);
DE (lapsed beg);
ASL (lapsed beg);
HU (tourist)
Language Log: viewtopic.php?f=15&t=3864&view=unread#unread
x 1675
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby arthaey » Sun Nov 22, 2015 9:30 am

emk wrote:I decided to try adding video support (in hopes of being able to extract audio clips and still images), and I was promptly sucked down into an ffmpeg nightmare. Investigations are underway.

Eww... vaya con dios with that issue. :(

emk wrote:Of course, the real reason for using Wine/VirtualBox is to get access to the amazing open source SubtitleEdit. If you can find a Mac replacement for that (and I think at least one exists), you can use that to OCR subtitles and fix their alignment.
[/quote]
I'm not familiar with SubtitleEdit (since it doesn't run on Macs, I never investigated it). What does it do that VLC + Aegisub don't?
0 x
Posts in: FrenchGermanHungarianSpanish
NaNoWriMo: 10,000 words
Corrections welcome in any language; I prefer an informal register.

User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6323
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby emk » Sun Nov 22, 2015 9:38 pm

iguanamon wrote:If you could write a program that would display bilingual subs and export them to srs with audio that would require no programming skills for people who aren't programmers- plug and chug, you could make some money. :)

Honestly, I'm convinced that there's a lot of money in language learner tools, but only if:

  1. You focus on selling to novice learners who want to learn a language but who have no clue, and
  2. You're very good at marketing.
But I'm more interested in building tools for well-informed, successful language learners—a tiny niche market—and I know that marketing a successful product requires a heartbreaking amount of work. Thus, I generally give my language learning tools away for free, because that way, I have fun sharing them, and my heart isn't broken by putting thousands of hours into a business and watching it fail.

arthaey wrote:I'm not familiar with SubtitleEdit (since it doesn't run on Macs, I never investigated it). What does it do that VLC + Aegisub don't?

SubtitleEdit is an excellent, all-in-one solution to every subtitle-editing need I have ever encountered. It can align subtitles, break subtitles in two, OCR subtitles, spell-check subtitles, compare subtitle timings against the audio wave form, and convert to and from almost every file format know to human kind.

Aegisub may be a perfectly nice tool. But if you ever find that it can't do what you need, I recommend SubtitleEdit highly—it actually replaces about three or four separate tools that appeared in my original subs2srs workflow.

arthaey wrote:
emk wrote:I decided to try adding video support (in hopes of being able to extract audio clips and still images), and I was promptly sucked down into an ffmpeg nightmare. Investigations are underway.

Eww... vaya con dios with that issue. :(

I have cut the Gordian knot by not linking to ffmpeg. Instead, I'm going to require the user to have the ffmpeg command-line tools installed separately, and I'll just run the necessary commands on the user's behalf. This is a net win for the user, because it's much easier to install a halfway decent version of the ffmpeg tools than it is to install the exactly correct version of the correct fork of the ffmpeg library. Also, the ffmpeg API is absolutely nasty to work with.

Anyway, I can now list out all the tracks associated with a video file, and the language they're in:

Code: Select all

$ substudy tracks avatar.mkv
#0 Video (eng)
#1 Audio (spa)
#2 Audio (eng)
#3 Subtitle (eng)

Isn't that pretty? And check out how nice the Rust code is.

The next step is to write a wrapper to extract still images from video streams, and another wrapper to extract audio clips. I already have a Rust library for identifying the language of a piece of text, so I should be able to automatically pick out the right audio track to match a given set of subtitles.

My plan is to have very good defaults, and only a couple of unavoidable knobs for overriding those defaults. I like software that makes good decisions automatically. :-)
4 x

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23123
Contact:

Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)

Postby rdearman » Sun Nov 22, 2015 10:24 pm

emk wrote:Isn't that pretty? And check out how nice the Rust code is.


Had a quick look at the documentation on the rust page. Don't suppose there is a good tutorial in French? That way I can learn a new programming language and a bit more French at the same time.
:D
1 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.


Return to “Language Programs and Resources”

Who is online

Users browsing this forum: No registered users and 2 guests