How not to learn Spanish: Building too much stuff, not studying enough

kundalini · Postby **kundalini** » Fri Mar 15, 2024 4:05 pm

There's a Python library that crates custom parallel texts by aligning existing translations (through machine learning) that you may find interesting:

https://forum.language-learners.org/vie ... 14&t=18986

It can produce texts that look like this:

Some potential drawbacks:

- It takes some work to prepare the texts. They require manually labeling the title, author, etc.
- It doesn't easily lend itself to generating flashcards. Not a downside for me, but sounds like it might be for you.
- I find it unpleasant to read long texts on LCD screens, and would much prefer e-readers. But the parallel text format doesn't lend itself well for the screen sizes of most e-readers. I think interlacing the text at the sentence level would be more appropriate for smaller screens, as discussed in this thread:

https://forum.language-learners.org/vie ... 19&t=16607

I have considered hacking together a script to go through the output HTML by paragraph then interlace the sentences when possible and export the text as epub, which would be right around the limits of my current programming ability, but haven't yet mustered the energy to do this!

Postby **rdearman** » Fri Mar 15, 2024 8:03 pm

emk wrote:So my takeaway is that I can profitably work with parallel texts, using any of the standard tools (extensive reading, Assimil-style learning, Anki sentence cards, Listening/Reading, etc). But sadly, all of the Android apps for doing this with actual ebooks are dumpster fires.

Do you remember helping me with text-alignment tools for the subs2srs thing I was doing for librivox books?

https://forum.language-learners.org/vie ... =10#p90940

Aeneas and hun align.

Postby **emk** » Fri Mar 15, 2024 8:15 pm

kundalini wrote:There's a Python library that crates custom parallel texts by aligning existing translations (through machine learning) that you may find interesting:
...
I have considered hacking together a script to go through the output HTML by paragraph then interlace the sentences when possible and export the text as epub, which would be right around the limits of my current programming ability, but haven't yet mustered the energy to do this!

rdearman wrote:Do you remember helping me with text-alignment tools for the subs2srs thing I was doing for librivox books?

Yeah, those book aligners are fantastic. Though the last time I looked, they took lots of tech skills—which isn't a problem for me, if it's just a couple of books to kick off the Super Challenge. I'd still love to support book alignment natively in substudy, but it's a whole separate problem.

It looks like two state-of-the-art algorithms are Vecalign and Bertalign. These both rely on larger underlying Python libraries, although someone ported the biggest pre-requisite for Bertalign to Rust. Still, it appears to depend on the C++ version of Torch, which means it probably requires a decent GPU and some handholding, plus big downloads. Ugh.

A file format for alignments. Many years ago, I sketched out a simple file format for storing aligned subtitles, video and books. Having one format shared between multiple tools and scripts would make it easy to do alignments with a Python script, then hand them off to a reading (or watching) tool, and then hand them off again to a card-creator tool.

The format would look something like:

Code: Select all

my_movie.aligned/ (the processed movie is stored as a folder!)
    metadata.json
    files/
        full-movie.mp4
        subtitle1.es.mp3
        subtitle2.es.mp3

The actual subtitles or sentences would live in metadata.json, as described as the link above. This would look something like:

Code: Select all

{
  "baseTrack": {
    "type": "media",
    "lang": "fr",
    "file": "episode1.mp4"
  },
  "alignments": [
    {
      "timeSpan": [
        10,
        15.5
      ],
      "tracks": [
        {
          "type": "html",
          "lang": "fr",
          "html": "<i>Jean &amp; Luc:</i> On y va !"
        },
        {
          "type": "html",
          "lang": "en",
          "html": "<i>Jean &amp; Luc:</i> Let's go!"
        },
        {
          "type": "image",
          "file": "episode1_12_75.jpg"
        },
        {
          "type": "media",
          "lang": "fr",
          "file": "episode1_9_00_16_50.mp3"
        }
      ]
    }
  ]
}

If I want to take substudy much further, I need to support a format something like this soon. It could be used to glue different tools together, especially things like sentence aligners.

Another weekend project. I should make a new Anki deck with episodes 3 & 4! I want to experiment with archiving all cards after 2 months, in favor of adding new cards from new materials.

And I should try to get the pure-dialog playlists into a phone music player again. Except that used to be easy, and now it's messier everywhere except Apple?

kundalini · Postby **kundalini** » Sat Mar 16, 2024 3:51 am

emk wrote:It looks like two state-of-the-art algorithms are Vecalign and Bertalign.

Thank you for this! Bertalign is exactly what I had in mind.

Using the Google Colab notebook (https://colab.research.google.com/drive/123GhXwgwmQp1F5SVZ74_uIgyxo6hLRq0?usp=sharing), it only takes about two minutes of work (and some wait time) to align txt files. Then I was able to export the aligned file (The Count of Monte Cristo with alternating sentences in French and English) as epub to put on my ereader. The results aren't perfect, and there are some artefacts, but they are more than good enough for my needs. I'll post a guide in a separate thread for others who may be interested.

image.png

Postby **emk** » Sat Mar 16, 2024 8:34 pm

kundalini wrote:Thank you for this! Bertalign is exactly what I had in mind.

Using the Google Colab notebook (https://colab.research.google.com/drive/123GhXwgwmQp1F5SVZ74_uIgyxo6hLRq0?usp=sharing), it only takes about two minutes of work (and some wait time) to align txt files.

Oh, I totally forgot about Google Colab. That's fantastic, because you can do it all through the web, instead of setting up Python and Torch locally.

Turning an episode into an Anki deck. First, we need a copy of the episode as a video file. (I own all the DVDs.) Then we can make a deck with just three commands!

Code: Select all

substudy transcribe avatar_01_07.mkv --example-text=avatar_intro.txt > avatar_01_07.es.srt
substudy translate avatar_01_07.es.srt --native-lang=en > avatar_01_07.en.srt
substudy export csv avatar_01_07.mkv avatar_01_07.es.srt avatar_01_07.en.srt

And boom, it's done. It takes less than 5 minutes, and costs about US$20 per episode.

You'll need an "OPENAI_API_KEY" set up, etc., like I talked about earlier. Here, "avatar_intro.txt" contains some sample text associated with the episode. Whisper uses this text to figure out what language the episode is in, and to provide a bit of context. I just pasted in the intro text repeated at the start of each episode. This outputs a directory "avatar_01_07_csv" containing "cards.csv" and a bunch of image and MP3 files.

Here's a pretty image.

substudy-full-workflow-small.png

Do not ask me why the progress bars are two different colors. I may not understand the progress bar library yet!

Also, anyone who says that AI is a scam that can't do anything useful is more than welcome to do these steps the old way, and compare. :lol:

Importing into Anki. This is the most complicated part of the process, by far. And I won't explain it here, because ugh, and because I'm spending way too much time on this project anyways. But here are some Anki templates to help.

Code: Select all

{{Image}}<br>
{{Sound}}

Code: Select all

{{FrontSide}}

<hr id=answer>

<div class="foreign">
<div class="prev">{{Foreign Prev}}</div>
<div class="curr">{{Foreign Curr}}</div>
<div class="next">{{Foreign Next}}</div>
</div>

<div class="native">
<div class="prev">{{Native Prev}}</div>
<div class="curr">{{Native Curr}}</div>
<div class="next">{{Native Next}}</div>
</div>

<div class="source">{{Source}} {{Time}}</div>

Code: Select all

img {
  max-height: 240px;
  max-width: 160px;
}

.card {
    font-family: arial;
    font-size: 20px;
    text-align: center;
    color: black;
    background-color: white;
}

.foreign, .native {
    margin-bottom: 20px;
}

.foreign .curr {
    font-size: 120%;
}

.native {
    font-size: 80%;
}

.prev, .next {
    color: grey;
}

.source {
    font-size: 60%;
    color: grey;
}

This gives us cards which look like:

Screenshot_20240316-154500-small.png

As always, I like to enjoy my reviews. So I put the image on the front of the card, add the surrounding lines of dialog for context, and automatically pad all audio clips by 1.5 seconds on either side. Easy cards are still surprisingly effective. You might even try shifting everything to the front except the current line of Spanish dialog. It might work great!

Review workflow. Don't think of this as "learning flashcards." Think of this more like an Assimil passive wave, or think of it as a way to "amplify" interesting bits of input and make them comprehensible.

Here are some suggested settings:

New cards per day: 10 or 20, if this is your first time. Your daily reviews will be about 10x this. First time Anki users get themselves into trouble here.
Daily review limit: 100 or 200. Don't worry if you fall behind some. Again, we're using Anki as an input "amplifier."
Leech threshold: No more than 2. If a card is hard, we don't want to bother.
Leech action: Suspend. Ditching leeches will make Anki roughly 300% more fun, I promise.
Interval for easy cards: I dunno, 5 days seems a little low if you already have a base in the language. Or you can just suspend easy cards on initial review.

When reviewing cards:

Mark a card as pass if you can understand at least 80% of the audio on the front side. Or use your usual Assimil passive wave rules. You'll probably get the other 20% once the card matures, anyway. Feel free to listen to it a bunch of times.
After the first couple of days with a card, hit any other buttons you want. Easy, hard, whatever. It doesn't matter. If a card's easy but hilarious, you can mark it hard so you see it more often. Again, we're using Anki to "amplify" input, not to force ourselves to do rote memorization.
Delete or suspend cards ruthlessly. If it makes you groan when you see it, it's gone.

Remember, you can have another 1,000 cards for 15 minutes of work and US$0.60. All you need is audio that Whisper-1 can transcribe mostly accurately. And so far, Whisper-1 has been terrifying good at transcribing clearly enunciated Spanish. So if you actually need to know something, it will show up on a better card soon enough. If it doesn't show up on another card soon, you probably didn't need it!

And don't forget, you can make a nice web page for reviewing your clips, or generate an MP3 playlist with just the dialog:

Code: Select all

substudy export review avatar_01_07.mkv avatar_01_07.es.srt avatar_01_07.en.srt
substudy export tracks avatar_01_07.mkv avatar_01_07.es.srt

I'll post a binary release later this weekend, so people can try it out! But for now, I'm going to go for a walk and listen to Spanish. I got a cheap set of open-ear, bone conduction headphones, so I'll be able to hear any bears sneaking up on me.

Postby **emk** » Sun Mar 17, 2024 11:25 am

Substudy 0.6.0 is out, with support for Whisper-1 transcriptions and automatic translations!

Honestly, if you're comfortable with basic command-line use, and if you're interested in working intensively with media files in a major language, then I feel like this release is worth checking out. This is the first release that can take a video file, transcribe subs, translate the subs, and turn them into a CSV file for Anki import, all with just a handful of commands and five minutes. You'll need an OpenAI key for transcription and translation, but it's amazing to see it work.

You may get worse results with less popular languages or with harder audio. Outside of the biggest 6-12 languages, transcription will likely go downhill rapidly.

Converting episodes to tracks. This is super easy:

You'll need some way to copy the files to your phone and play them. You can probably just add them to iTunes normally. On Android, I'm using Autosync for Google Drive and Musicolet. When this is done, I can play just the dialog:

Putting this on loop allows me to listen to an episode in under 10 minutes. This gets pretty intense, actually.

Headphone review. This kind of track review works well with my new discounted Shokz headphones. I tried these out on my walk yesterday, and they're pretty handy. I can hear everything around me normally, I can hear my phone with no problem, and nobody else can really hear anything. So if you want to imitate Khatzumoto and salvage every spare moment for language study, this sort of bone conduction headphone is potentially useful.

Studying episode 7. Over half a decade ago, when my original Subs2SRS experiment was active, I wrote:

emk wrote:Episode 7 (watched in French about a year ago). This is the purest test of comprehension. I remember the plot of this episode (more or less), but I've forgotten all the details. And I have good news: I can follow most of what's going on! I'm definitely missing over half the dialog, but I almost always get at least several words per sentence, and sometimes I get multiple consecutive lines. Overall, this is actually pretty fun.

I've almost certainly lost some of my ability to just sit down and watch an unfamiliar episode (and I haven't even tried re-watching 7 yet). But I'm sure I can get that back. In the meantime, I'm getting a lot out of these cards—there's plenty of new stuff in the details.

Nubes has that [β] sound, which still throws me for a loop when I hear an unfamiliar word. Good practice for my ear; I'm going to have to upgrade this sound to a real phoneme in my head. And I plan to convert the songs Causa y efecto and No Hay Nadie Como Tú into cards, too, so I can switch off to music when I get overwhelmed by dialog.

Anyway, I now have enough tools and content to actually study Spanish! This will keep me moving forward on Spanish while I figure out how to build a video-watching UI, or mess with ebooks. :lol:

Postby **emk** » Sun Mar 17, 2024 11:39 pm

Substudy 0.6.1 is out at the usual place. (See my previous message.) I spent about two hours modifying it to extract album covers from music files, which was ridiculous yet strangely satisfying.

But now I can make a music file into Anki cards almost entirely automatically:

There's a lot of good vocabulary in these songs! And once these cards mature, I'll be able to reinforce lots of interesting bits of Spanish every time I listen to this playlist.

And Avatar goes well:

Avatar is just a goldmine of conversation, including lots of travel and meeting people. There's a lot I can pull from it.

On the first card, I'm still working to internalize the fact that Spanish is a pro-drop language. English and French aren't, partly because French has mandatory clitic pronouns that might be written separately from the verb, but which are effectively part of it. By now, I've mostly internalized things like no es culpa tuyo. But my brain still flails around with subordinate clauses like que pasara, expecting a subject pronoun of some sort. Especially in native-speed speech. I'm more likely to have comprehension gaps here.

But hey, that's the whole point here. I'm trying to leap straight into full speed intermediate audio, because (1) it's more fun, and (2) hopefully I won't need to speed up my basic processing in order to reach a solid B1. I have the tools and I have the input volume, because I don't need to look for subs anymore.

On the other card, you can see that AnkiDroid is slick. If I'm curious about a phrase, I can ask Google Translate with just a press and a tap.

Probably I'd need 3,000-5,000 words to really thrive with native media. I can potentially learn more than 1 word per card. But I can't rely on TV dialog and music lyrics forever, so I want to get into interlinear books in the coming months. At 10-15 cards per day, it's going to take at least 6 months to get good coverage of core vocabulary and start to internalize it.

Today's discovery. Spotify suggested a song with some nicely enunciated lyrics:

This should make great cards.

Today's study. I can now settle into a nice workflow.

Anki reviews: 10 new Avatar cards, 10 new music cards. About 20 minutes for everything, mostly while taking a walk. I was driven inside by hail.
Avatar: Listened to just the dialog of Avatar 01.02. Call it another 10-12 minutes.
Music: Listened to a bunch with 50% attention.

kundalini · Postby **kundalini** » Mon Mar 18, 2024 2:25 am

emk wrote:But I can't rely on TV dialog and music lyrics forever, so I want to get into interlinear books in the coming months.

What are you thinking of doing with interlinear books?

Postby **emk** » Mon Mar 18, 2024 3:07 am

kundalini wrote:What are you thinking of doing with interlinear books?

Ideally? About 500 pages of reading. :lol:

That would be enough to kick my input skills up near B1 somewhere.

The challenge is that all my vocabulary is spoken stuff, and my overall skill levels are (to put it charitably) sub-A1 except for some weird peaks and specialized vocabulary. My official "study" of Spanish was about 30 hours over half a decade ago, plus several seasons of partially-followed TV, and whatever I just picked up in the last two weeks while working on substudy.

But now that I have some very rudimentary Spanish grammar and some vocab, if I can read a sentence in English, then I have maybe an 80% chance of understanding how the Spanish sentence works. The cognate discounts I get from English and French are no joke.

So if I had a nice 500-page book, plus an interlinear translation, I could maybe knock out 5 pages per day? It would be an interesting experiment.

Postby **emk** » Mon Mar 18, 2024 8:28 pm

Today:

39 Anki cards (15 new) in around 17 minutes.
Listened to the dialog of Avatar 1.07 twice during a lazy stroll in the sun. Apparently I can listen to an episode twice in a mile!

So I listened to the same episode I'm learning with Anki. This provides a boost, and throws off review timings. But I don't really care about review timings much. I do like how much of the dialog I could understand: Some parts were still difficult, especially without the video. But there were several sections where I got most of a conversation. And since I last saw this episode over 5 years ago (without subs), that's actually pretty good!

Here are some cards where I was doing surprisingly well:

I could get ~5/6ths of these lines without any major problems, even without the video. Some others require much closer attention and multiple listens. A lot of times, I'm just missing a vocab word or a new idiom, and once I have that, I can get the line in the future.

A cool Anki discovery. Anki makes it annoyingly hard to generate cards in other tools, then import them into Anki. There's no API on Anki web for this. And importing by hand requires creating models and templates, and copying media into a hidden collections.media directory. But I have just discovered the Anki-Connect plugin, which provides a local REST API to any other application on your computer.

I should start by adding a "substudy export anki" feature which talks directly to Anki-Connect and which automates the whole process. Since importing into Anki is the most difficult step by far, this would be a win.

A language learners’ forum

How not to learn Spanish: Building too much stuff, not studying enough

Re: How not to learn Spanish: Building too much stuff, not studying enough

Re: How not to learn Spanish: Building too much stuff, not studying enough

Re: How not to learn Spanish: Building too much stuff, not studying enough

Re: How not to learn Spanish: Building too much stuff, not studying enough

Back in business with Anki!

substudy v0.6.0

Some sample reviews

Re: Some sample reviews

Re: Some sample reviews

Re: How not to learn Spanish: Building too much stuff, not studying enough

Who is online