Help me create a Subs2SRS course?

General discussion about learning languages
User avatar
emk
Black Belt - 1st Dan
Posts: 1691
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6617
Contact:

Re: Help me create a Subs2SRS course?

Postby emk » Tue Dec 05, 2017 12:17 pm

Here's a Wayback Machine copy of Sprachprofi's experiment with Japanese.

smallwhite wrote:I didn't know that! So her method is very different from yours, then.

Yeah, keep in mind I was an English and French speaker learning Spanish, which is really a double discount! (There's a lot of stuff going on with Spanish verbs that has much more in common with English verbs than with French ones.) And Avatar is an unusually good TV series for this sort of thing: It involves tons of straightforward conversation between kids and teenagers, and almost every episode involves either traveling or meeting people for the first time.

When I tried a harder film like Y Tu Mamá También, then I could still start at the beginning and go through the movie in order, but I had to throw out close to half the cards because they were too hard or irrelevant. And if I were studying a totally unfamiliar language with no "discount" (like Sprachprofi with Japanese), then I'd want to hand-pick the order of my cards, too.

Basically, if you use subs2srs-style cards with Anki, then you can study TV that's 2 or 3 CEFR levels above your "natural" level. So if you just started learning the language yesterday, then you could work with TV that you'd normally use around B1: really easy stuff with clear enunciation and straightforward dialog. If you're around B1, then you could tackle TV that's normally appropriate for a C1 student. (When I was around B1 in French, I spent a lot of time on MC Solaar's faster rap songs, and I still have them stuck in my head a half-decade later.) And so on.

And accurate subtitles are essential. I've tried subs2srs on French films with "approximate" subtitles, and it's maybe only 20% as effective. At the very least, you would want 4 sentences out of every 5 to be exact, and the 5th should at least be close.

So if we want to make a course for total novices, what I suspect we need would be good native audio content that a B1 student might enjoy. Jules Verne audio books would be really more B2ish, I suspect, unless the reader is very slow and clear.

smallwhite wrote:I tried a little bit of Assimil recently for the first time, and I don't get why it's so popular. I think your Subs2SRS would be more fun even if I don't like TV.

For people who like Assimil, I think the experience goes something like this:

  • Day 0: "Huh, I don't know this language at all, but somebody said I should try this course."
  • Day 1: "Wait, all I have to do is listen to this really slow audio 10 times while looking at the L2 and L1 text? And then I'm done for the day? That's pretty easy, and it only takes about 20 minutes. I can do this."
  • Day 15: "This is pretty easy, but I'm not sure I'm learning anything."
  • Day 30: "I still don't feel like I'm learning much, but then I looked back at lesson 1, and it was so slow and easy!"
  • Day 49: "Wow, the active wave is scary and I start it tomorrow."
  • Day 50: "OK, I'm really happy that lesson 1 is so slow and pathetically easy, because I wouldn't have survived the 'active wave version' of it otherwise."
Basically, Assimil doesn't feel "productive" at all, and you're never sure it's working. And then all of sudden you look back at an old lesson and it blows your mind. Then you start going on language forums and you keep telling everybody Assimil is awesome. ;-) I suspect a course like Pimsleur is actually a very similar experience, because it uses some of the same repetition tricks.

With subs2srs, I can see the progress with each individual card separately:

  • The first time I see a card: "This new card is so brutal. I think I listened to it about 15 times before I could follow it, and I'm still only getting about 80%."
  • After 4 to 7 days, and maybe 3 reviews: "That's still a hard card, but I think I'm getting a handle on it."
  • After 25 to 30 days, and maybe 5 reviews: "Wait, what? That card is so easy. I just listen to it and I understand it. How did I ever think that was hard?"
  • After doing no Anki reps for 2 years and ignoring Spanish: "Huh, so about 80% of that card is pretty easy if I listen a couple of times, but I'm not sure about that final 20%." (And after a few hours of reps over a week or two, things come back quickly.)
It's almost as if something "clicks" in my brain around 20 to 30 days after I first start reviewing a card, and what was originally difficult becomes completely obvious. So the trick would be to get students to stick around until at least day 7, and preferably day 30.

So this is why I'd love to find a public-domain/Creative Commons graphic novel, like iguanamon suggested, and then find out what it would take to pay an entertaining voice actor to read it (using a speed and enunciation appropriate for B1 students, not total novices, with lots of emotion and personality). If we found the right book, it might actually make an awesome course for total beginners.

But the choice of materials is definitely important.
4 x

User avatar
rdearman
Site Admin
Posts: 7255
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23258
Contact:

Re: Help me create a Subs2SRS course?

Postby rdearman » Tue Dec 05, 2017 1:11 pm

OK, an update on the poorly chosen Jules Verne book. The forced alignment tool (SPPAS) which was one of the only ones you can do both Fr & En is confusing in every way shape or from, has near zero documentation and the tutorials are all blank! It uses Julius and python to generate a GUI, but no amount of selecting or file format conversion or anything else seems to work. I suspect it works well for the author, who wrote it or someone familiar with Julius and is just looking for a GUI on top, but for me it is just a confusing mess. :(

I've reloaded my self generated stuff into sub title edit and to be honest just using an estimate of 9.11 seconds has everything pretty much aligned. It wouldn't take much time for a human to parse it. It will be easier for me to figure out how to use Sub-Title edit for alignment that figure out the forced alignment tools. I don't really want to have to learn the equivalent of a associates degree in linguistics just to align some sub-titles. :cry:

It would seem the Julius tool is probably the most useful for emk's purposes, since it doesn't seem to require a license and you can call it from other programs. It is also cross-platform.

(I realise I have probably given up to early, but there isn't a lot of documentation and I really can't be arsed to reverse-engineer the file format requirements.)
1 x
: 26 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
emk
Black Belt - 1st Dan
Posts: 1691
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6617
Contact:

Re: Help me create a Subs2SRS course?

Postby emk » Tue Dec 05, 2017 1:25 pm

rdearman wrote:OK, an update on the poorly chosen Jules Verne book. The forced alignment tool (SPPAS) which was one of the only ones you can do both Fr & En is confusing in every way shape or from, has near zero documentation and the tutorials are all blank! It uses Julius and python to generate a GUI, but no amount of selecting or file format conversion or anything else seems to work. I suspect it works well for the author, who wrote it or someone familiar with Julius and is just looking for a GUI on top, but for me it is just a confusing mess. :(

I'm really sorry to hear that. I could have sworn that somebody here or at HTLAL had managed to set up a forced alignment tool at some point and had used it to synchronize an audiobook. Please, somebody, figure this out and write it up so that I can be lazy and follow your instructions. ;-)

If you only need to align a chapter or two, then it's perfectly feasible to just start with 9.1 seconds per sentence, and then fix it up in Subtitle Edit manually.

As I noted up-thread, I think if you use a Jules Verne audiobook, you're probably going to get something that works quite well for A2 students. I could first struggle through Jules Verne when I was nearing B2 in French, and it was pretty reasonable once I had C1ish comprehension. So subtract 2 or 3 CEFR levels, and you've got something around A2.

Whereas Avatar is more in the "solid B1" range, and so it works quite well for a false beginner, or even a true beginner who knows what they're doing. Something like Y Tu Mamá También is really a "solid C1" type of film, and so it's fairly painful for a beginner, even with the 2- or 3-level boost from using audio cards.
1 x

User avatar
emk
Black Belt - 1st Dan
Posts: 1691
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6617
Contact:

Re: Help me create a Subs2SRS course?

Postby emk » Tue Dec 05, 2017 4:25 pm

rdearman wrote:(I realise I have probably given up to early, but there isn't a lot of documentation and I really can't be arsed to reverse-engineer the file format requirements.)

According to user davidzweig, the Aeneas project is actually pretty reasonable for aligning audiobooks against text (but it won't work for movies). Any willing victims volunteers want to try to get it running? They have a Docker-based version you can run in a container, and it seems to accept text files and mp3s as inputs based on a half-second glance at the site.

The site says:

Confirmed working on 38 languages: AFR, ARA, BUL, CAT, CYM, CES, DAN, DEU, ELL, ENG, EPO, EST, FAS, FIN, FRA, GLE, GRC, HRV, HUN, ISL, ITA, JPN, LAT, LAV, LIT, NLD, NOR, RON, RUS, POL, POR, SLK, SPA, SRP, SWA, SWE, TUR, UKR
1 x

User avatar
rdearman
Site Admin
Posts: 7255
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23258
Contact:

Re: Help me create a Subs2SRS course?

Postby rdearman » Tue Dec 05, 2017 6:31 pm

emk wrote:
rdearman wrote:(I realise I have probably given up to early, but there isn't a lot of documentation and I really can't be arsed to reverse-engineer the file format requirements.)

According to user davidzweig, the Aeneas project is actually pretty reasonable for aligning audiobooks against text (but it won't work for movies). Any willing victims volunteers want to try to get it running? They have a Docker-based version you can run in a container, and it seems to accept text files and mp3s as inputs based on a half-second glance at the site.

The site says:

Confirmed working on 38 languages: AFR, ARA, BUL, CAT, CYM, CES, DAN, DEU, ELL, ENG, EPO, EST, FAS, FIN, FRA, GLE, GRC, HRV, HUN, ISL, ITA, JPN, LAT, LAV, LIT, NLD, NOR, RON, RUS, POL, POR, SLK, SPA, SRP, SWA, SWE, TUR, UKR

I'll have a look later.
0 x
: 26 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
emk
Black Belt - 1st Dan
Posts: 1691
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6617
Contact:

Re: Help me create a Subs2SRS course?

Postby emk » Tue Dec 05, 2017 7:56 pm

rdearman wrote:I'll have a look later.

OK, I had to make certain diabolical bargains, but I just got Aeneas working using the aeneas-vagrant virtual machine.

I can post full instructions on getting the VM working at some point.

Your input should be one MP3 file, and one plain, correctly-encoded UTF-8 text file which looks like this:

Code: Select all

Ce Phileas Fogg était-il riche ?
Incontestablement.
Mais comment il avait fait fortune, c'est ce que les mieux informés ne pouvaient dire,

Each line of the text file will become a subtitle, so break things up into a chunk size you like. Clean out any clutter and make sure your text matches your audio reasonably well! Pay particular attention to things like "Chapter 1".

From there, you need to run the following command:

Code: Select all

python -m aeneas.tools.execute_task \
    tour_monde_verne_01.mp3 \
    tour_monde_verne_01.fr.txt \
    "task_language=fra|os_task_file_format=srt|is_text_type=plain" \
    tour_monde_verne_01.fr.srt

This takes about 10 seconds, and it spits out a nicely-aligned SRT file (well, I only checked a small part, but alignments seemed quite good in the few places I checked).

Oddly, very few video players can handle MP3 files with subtitles. The only one I could get to work is my own substudy prototype!

Image

However, I think to make this really useful would require several steps:

  1. Break the French text into nice "chunks", one per line.
  2. Use a tool like hunalign to align the English text against the French text.
  3. Align the French text against the French audio using aeneas, generating French subtitles.
  4. Finally, write a script to turn the English text into English subtitles, using the timing from the French text.
Still, this is totally doable for a pre-packed "course", because only one person would need to do it (always assuming your audio is in the public domain or covered by Creative Commons).

UPDATE: I listened to half a chapter, and the alignment is almost perfect. So text-to-audio alignment is a solved problem. However, it looks like hunalign can only align at the sentence level, so if you want to align English and French text, you'll have pretty long cards, thanks to those giant 19th-century sentences.

My verdict: This would be a pretty nice tool for somebody around A2 who wanted to work on their reading, but a Jules Verne audiobook doesn't really work as an entry-level deck. Also, you'd probably either want to be selective about which sentences you turn into cards, or to delete cards very aggressively. Still, I might try something like this with my Spanish, to fill in my "written" vocabulary.
1 x

User avatar
rdearman
Site Admin
Posts: 7255
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23258
Contact:

Re: Help me create a Subs2SRS course?

Postby rdearman » Tue Dec 05, 2017 8:15 pm

Might be nice to use this for close deletion type cards, where you hear the whole sentence then fill in the blanks. Looks like a good tool. I'll work on it. Regardless of level it would still be a good deck.
3 x
: 26 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
rdearman
Site Admin
Posts: 7255
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23258
Contact:

Re: Help me create a Subs2SRS course?

Postby rdearman » Tue Dec 05, 2017 11:21 pm

emk wrote:This would be a pretty nice tool for somebody around A2 who wanted to work on their reading, but a Jules Verne audiobook doesn't really work as an entry-level deck.

This confused me a bit. A2 is a "Basic User" so why wouldn't this be good for them? It wouldn't be a bad thing for anyone I figure, even BX users.

I couldn't be bothered with a VM, and I have two linux servers in my house, so I've just installed it. Bit of a fiddle, since I had to compile espeak from source to get all the right libraries. But hoping to test it shortly.

Code: Select all

python -m aeneas.diagnostics
[INFO] ffprobe        OK
[INFO] ffmpeg         OK
[INFO] espeak         OK
[INFO] aeneas.tools   OK
[INFO] shell encoding OK
[INFO] aeneas.cdtw    AVAILABLE
[INFO] aeneas.cmfcc   AVAILABLE
[INFO] aeneas.cew     AVAILABLE
[INFO] All required dependencies are met and all available Python C extensions are working
0 x
: 26 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
emk
Black Belt - 1st Dan
Posts: 1691
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6617
Contact:

Re: Help me create a Subs2SRS course?

Postby emk » Tue Dec 05, 2017 11:40 pm

rdearman wrote:
emk wrote:This would be a pretty nice tool for somebody around A2 who wanted to work on their reading, but a Jules Verne audiobook doesn't really work as an entry-level deck.

This confused me a bit. A2 is a "Basic User" so why wouldn't this be good for them? It wouldn't be a bad thing for anyone I figure, even BX users.

Sorry if I wasn't clear! What I meant was, "If you make a deck from this audiobook, it will probably be useful for A2 students and up. But if somebody literally starts to learn French by studying this deck, they're probably going to have a rough time."

Like I said, my usual rule of thumb for sub2srs cards is that I can use native materials up to 2 or 3 CEFR levels above my "natural" level—what I could just sit down and watch, say—and it's usually interesting and productive. But if I choose something harder than that, it gets frustrating very quickly. You can turn "N+5" input into N+1 input, but you can't turn N+20 input into N+1 input. :-)

I also manged to get hunalign working moderately well. It will align the English and French text so we know which sentences go with which:

Code: Select all

hunalign -text hunalign_dict.txt tour_monde_verne_01.fr.txt tour_monde_verne_01.en.txt

This outputs a file like:

Code: Select all

-- Oui, monsieur.       "Yes, monsieur."        0.3
-- Bien.        "Good!  0.257143
Quelle heure avez-vous ?        What time is it?"       0.16
-- Onze heures vingt-deux, répondit Passepartout, en tirant des profondeurs de son gousset une énorme montre d'argent.  "Twenty-two minutes after eleven," returned Passepartout, drawing an enormous silver watch from the depths of his pocket.       0.457929

This could then be processed with a script to extract just the French text and feed it to aeneas to add timing information. Then the English can be turned into a subtitle file by copying the timing from the French.

As the classic saying goes:

In #DataScience, 80% of the time is spent preparing data, 20% of the time is spent complaining about the need to prepare data.

My apologies to anybody who finds the tech stuff boring. I just want to save rdearman the trouble of aligning an entire book by hand. :-)
1 x

User avatar
rdearman
Site Admin
Posts: 7255
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23258
Contact:

Re: Help me create a Subs2SRS course?

Postby rdearman » Wed Dec 06, 2017 12:15 am

Cool, it works. I had some issues with the text files because messing around with them on windows seems to have confused the file encoding. So I'll start again and download the text to my linux box and do all the hunalign, split, etc. there and then run the forced alignment tool.

I played it in SubTitle Edit and it was perfectly aligned for the 3 different srt files I tried. So, fix the encoding and it is a done deal. I had to compile hunalign, but since I'm modifying the file to have one sentence per line, it was pretty good on the alignment FR to EN as well. Actually it makes for a good LR system. You can Listen & Read all day long. :)
2 x
: 26 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.


Return to “General Language Discussion”

Who is online

Users browsing this forum: No registered users and 2 guests