How not to learn Spanish: Building too much stuff, not studying enough

Postby **emk** » Sat Mar 09, 2024 6:35 pm

MorkTheFiddle wrote:After my post yesterday I tried using the Windows Ubuntu app. The process still did not succeed, but at least Ubuntu gave me an error message clearly laying the blame on the lack of a certificate.

I suspect that in order to substudy working on Windows, we'll need someone familiar with either the Windows Console or Powershell. Substudy.exe is a normal Windows command-line app (not an Ubuntu app, or a MinGW app), and it requires a working ffmpeg install. I can see that substudy works on Windows, because the command-line test harness actually manages to run it on Windows, and it sees the expected output. But that happens on a Windows test server in the cloud, and I can't access it directly.

Unfortunately, I can't turn the test harness setup into step-by-step instructions. So sadly, I think we'll need to wait to see whether someone makes a walkthrough video.

Postby **emk** » Sat Mar 09, 2024 11:02 pm

Whooo! Whisper works!

I now have some very reasonable looking subs for Avatar 01.03, which I've never had. When I started this project, I had good subs for espisodes 01.01, 01.02, 01.05 and 01.06—and nothing else. This was never enough to really do the study experiment right; I could have used another 2-4 episodes.

But Whisper is working!

avatar_01_03_good_subs.jpg

These subs actually match the audio, unlike the SRT files I originally used. It takes less than a minute to process a 20-minute episode, and it costs about US$0.15. I really think this changes the game, at least for popular languages and clearly-enunicated audio.

And there are tons of sites that offer automatic subtitle translation once you have an SRT file. So this means you can get bilingual subtitles!

Subtitle wrapping and splitting. The other thing I've been working on is splitting subtitles. For example, this subtitle is too long in most video players, and it's a bit too long to make an ideal audio card:

Han pasado cien años y la nación del fuego está alcanzando la victoria\nen esta guerra.

But I found a Knuth-Plass line-breaking library and experimented until I could get clean breaks:

Code: Select all

12
00:00:44,300 --> 00:00:47,100
Han pasado cien años
y la nación del fuego

13
00:00:47,100 --> 00:00:49,080
está alcanzando la victoria
en esta guerra.

Remember, when these get turned into cards, I always add (1) the previous and next lines, and (2) 1.5 seconds of audio padding before and after. This makes it feasible to study cards with sentences cut in two.

Next challanges. The Whisper→SRT code isn't quite good enough yet. There are two problems:

"Phantom" subtitles. These tend to be short, common phrases. But they don't actually exist in the audio track, and the guestimated timing is garbage. I need to strip these out, somehow. I get a half dozen of these per episode.
Subtitles which start early/end late. Whisper is surprisingly good about know what words it heard in what order. But oftentimes, when it's looking at tiny words like "el" and "en", it knows it heard them, but can only say, "Somewhere in that 5 seconds, I think?" So for most subtitles, the timing is tight. But occasionally you get a subtitle which starts 5 seconds too early, or ends 5 seconds too late. I think I can clean this data up with a few heuristics.

So I'm really optimistic—this is getting close to the point where it's a big win. Pick a clean TV series in a major language, feed the audio into Whisper & substudy (& optionally a translator), spit out a deck of really quite decent audio cards. And the more tools I can combine into one, the easier it gets.

Oh, and all the time that I'm working on this code? I'm listening to Spanish dialog, reading Spanish subtitles, and checking carefully to make sure they match. So some studying is getting done by accident. :lol:

Postby **rdearman** » Sun Mar 10, 2024 12:29 am

Don't suppose you have uploaded the rust/whisper code anywhere? I am curious.

Postby **emk** » Sun Mar 10, 2024 4:17 am

rdearman wrote:Don't suppose you have uploaded the rust/whisper code anywhere? I am curious.

I'm going to polish everything up some more, and make a release.

My work-progress right now is on GitHub on the whisper2srt branch. You could theoretically try to get it working by looking at python-experiments/README.md and substudy/README.md. I'm running commands like:

Code: Select all

python whisper.py scratch/avatar_01_03.mp3 scratch/avatar_intro.txt scratch/avatar_01_03.json
env RUST_LOG=substudy::import::whisper=debug,substudy::segment=debug \
    cargo run -p substudy -- import whisper scratch/avatar_01_03.json > scratch/avatar_01_03.srt

But honestly, it's probably going to be far easier to wait until I finishing moving all the Python code to Rust, and until I finish dealing with weird subs and timing.

Oooh, shiny. Looks like ChatGPT 3.5 can do basic subtitle translation! This is better than some of the free web-based subtitle translators I've seen.

avatar-gpt-3.5-translate.png

At this point, I'm convinced that it should be possible to take a simple video file, with no subs, and automatically generate blingual subs that are more than good enough to use in a Subs2SRS deck. 8-)

Now, all that remains is for someone to actually code it!

Postby **emk** » Mon Mar 11, 2024 1:43 am

rdearman wrote:Don't suppose you have uploaded the rust/whisper code anywhere? I am curious.

Substudy 0.5.1 is out! This features a new command, import whisper-json. This can be used to generate surprisingly high quality SRT files, at least for easy Spanish.

Code: Select all

substudy import whisper-json my_movie.json > my_movie.srt

To use this, you will first need to generate "my_movie.json", which you can do using this Python script and an OpenAI API key. Yeah, not very user friendly, but I haven't converted that code to Rust yet. This is the "rdearman wants to try it" release. If you haven't worked with Python before, please wait until I have something nicer.

There are some Mac/Linux setup instructions for the Python scripts. You will also need to stick to audio files smaller than 25 MB. Here's how I extract the audio file from the video:

Code: Select all

ffmpeg -i video.mp4 -vn -acodec mp3 -ac 1 -b:a 128k extracted_audio_for_whisper.mp3

The "whisper.py" script wants a prompt file, which should contain:

For songs, the lyrics to the song.
For TV shows, either the opening text, or a bit of sample dialog. You can also add a list of character names.

This gives the transcriber a hint about what you want. Only the first 244 tokens will be used.

Where I've invested my effort so far: Turning Whisper output into good subtitles. There's still some tweaking left to do, but the timing is better than most fansubs I've seen in the wild. Pretty accurate transcriptions with good timing are a game changer.

What's next: Probably some combination of:

Automatic translations using ChatGPT.
A Rust-based replacement for the "whisper.py" script, with support for audio files larger than 25 MB.

This will take a little while.

An transcriber stress test. Want to confuse your AI? Try this song:

There's a lot going on here, particularly where there are two nearly-overlapping singers singing at different volumes. Whisper tends to ignore the quieter singer, but seems to do a pretty solid job on the primary singer. Also, there are some repeated lines in the refrain, and Whisper often gets the number of repeats wrong.

My Spanish study today has been listening to lots of Avatar and music while carefully checking subs. I guess it's a "learning technique!"

Postby **rdearman** » Mon Mar 11, 2024 9:29 pm

I don't have an openapi key, but I do use whisper with this script.

Code: Select all

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from whisper.transcribe import cli
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(cli())

I'm just checking this out, if it doesn't work I'll probably have to get another key. (My old one ran out)

EDIT: I'm pointing this out because it generates a SRT file as well as tsv, json, txt, and vtt files.

Postby **emk** » Mon Mar 11, 2024 10:04 pm

rdearman wrote:I don't have an openapi key, but I do use whisper with this script.

Yeah, there's a downloadable Python version of Whisper. But I'm using OpenAI's paid online version of "whisper-2", because:

As xkcd points out, Python installs are basically cursed and much harder than getting a single-file binary executable to run. (If we ignore ffmpeg for the moment.) If I go this route, there is no possible future in which ordinary users can use substudy.
I also plan to use the OpenAI API for translations, and for adding definitions/explanations to cards automatically. Big study workflow improvements are possible here, and I can get them all with just one API key.

Postby **emk** » Mon Mar 11, 2024 11:17 pm

rdearman wrote:EDIT: I'm pointing this out because it generates a SRT file as well as tsv, json, txt, and vtt files.

Yes, feel free to create SRTs directly from Python Whisper. If you do, please let me know how well that works!

The Rust code in substudy does an excessive amount of extra work to tighten up subtitle spacing, and to split TV dialog subtitles in slightly better places for SRS cards. But you may end up preferring their version?

I'll be spending an evening working on an automatic subtitle translation tool, even for SRT files where you don't have the audio. So far, this seems to work better than the free online subtitle translators I tried. But really, my goal here is to have a one-stop tool that doesn't require gluing together 6 other tools. :lol:

Progamming music. It's time for some Mägo de Oz! (It's always time for some Mägo de Oz.)

Mägo de Oz is one of my all-time-favorite Spanish music discoveries.

sfuqua · Postby **sfuqua** » Tue Mar 12, 2024 12:37 am

This is very exciting stuff. I wasted a few days trying to get subtitles for a Children of the Sea, Kaijuu no kodomo, before I gave up... I eventually got a subs2srs deck for the movie where the audio and the written stuff barely matched...

This would be wonderful

Carl · Postby **Carl** » Tue Mar 12, 2024 1:57 am

emk wrote:Oooh, shiny. Looks like ChatGPT 3.5 can do basic subtitle translation! This is better than some of the free web-based subtitle translators I've seen.

This is a neat project! I put a lot of time into Avatar on sub2srs during pandemic lockdown, so I recognized all the lines in the images you posted, but it was so time-intensive to make the materials that I stopped. This would be quite helpful.

Before you go down the ChatGPT 3.5 translation path too far, you might check into how reliable it is in longer translations. I've found it impressive for relatively shorter chunks of text. It can identify the antecedent for "su" in Spanish correctly more often than DeepL or Google Translate, for example; it is less likely to translate "Juanita tuvo su período" as "Jaunita had his period."

But I tried feeding it a longer text--the entire Abschnitt I of Kafka's Die Verwandlung, I believe--for a parallel text project, and after a few thousand words, I couldn't make out how the translation lined up with the original. When I looked more closely at ChatGPT's translation, I realized ChatGPT was hallucinating the translation. It had gotten off on a tangent and was telling a completely different story than the one Kafka had written.

This was just one anecdote. But if it's a general issue, maybe you could simply feed ChatGPT shorter chunks of subtitles at a time, rather than, say, a whole episode's worth?

A language learners’ forum

How not to learn Spanish: Building too much stuff, not studying enough

Re: Whisper timing data is weird.

Good subs for Avatar 01.03!

Re: How not to learn Spanish: Building too much stuff, not studying enough

Re: How not to learn Spanish: Building too much stuff, not studying enough

substudy 0.5.1

Re: How not to learn Spanish: Building too much stuff, not studying enough

Re: How not to learn Spanish: Building too much stuff, not studying enough

Re: How not to learn Spanish: Building too much stuff, not studying enough

Re: How not to learn Spanish: Building too much stuff, not studying enough

Re: How not to learn Spanish: Building too much stuff, not studying enough

Who is online