Best automatic speech-to-text for challenging audio? Whisper? Something else?

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
emk
Black Belt - 1st Dan
Posts: 1710
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6746
Contact:

Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby emk » Thu Feb 22, 2024 3:33 pm

Hello! I've been thinking about language learning techniques again. One of the biggest limitations to an otherwise excellent tool like Subs2SRS is the need for accurate subtitles.

I've heard good things about Whisper and a few other speech recognition tools. Has anyone tried using these with moderately challenging audio? Specifically, I'm interested in slang, poorly enunciated speech, and mild regional accents. (For French speakers, think of something like Engrenages, Kaamelott or the Taxi films. These tend to be good listening exercises when working on C1 and C2.)

Are we at the point yet where we can pass an audio file to an API, and get back an accurate transcript? Can we get timing information, too? It's OK if it costs money.
4 x

User avatar
iguanamon
Black Belt - 2nd Dan
Posts: 2363
Joined: Sat Jul 18, 2015 11:14 am
Location: Virgin Islands
Languages: Speaks: English (Native); Spanish (C2); Portuguese (C2); Haitian Creole (C1); Ladino/Djudeo-espanyol (C1); Lesser Antilles French Creole (B2)
Studies: Catalan (B2)
Language Log: viewtopic.php?t=797
x 14269

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby iguanamon » Thu Feb 22, 2024 6:57 pm

A few months ago, member Kraut posted a link to voscribe. I've tried it out with two of the languages I speak. The example below is from a Portuguese (Brazilian) audio book I have. Voscribe on the left from the audio book. Book text on the right.
Luis Verríssimo: As mentiras que os homens contam wrote:Portuguese
Ela o encontrou pensativo, em frente aos vinhos importados. Quis virar, mas era tarde. O carrinho dela parou junto ao pé dele. Ele a encarou, primeiro sem expressão, depois com surpresa, depois com embaraço, e no fim os dois sorriram.


Tinham estado casados seis anos e separados um, e aquela era a primeira vez que se encontravam depois da separação. Sorriram, e ele falou o antes dela. Quase falaram ao mesmo tempo.

— Você estava morando por aqui?

— Na casa do papai.

— Na casa do papai. Ele sacudiu a cabeça, fingiu que arrumava alguma coisa dentro do seu carrinho, enlatados, bolachas, muitas garrafas, tudo pra ela não ver que ele estava muito emocionado.


Soubera da morte do ex-sogro, mas não se animara a ir ao enterro. Fora logo depois da separação, ele não tivera coragem de ir dar condolências formais à mulher que uma semana antes ele chamara de vaca. Como era mesmo o que ele tinha dito?

Tu és uma vaca sem coração. Ela não tinha nada de vaca, era uma mulher esbelta e continuava bonita, mas na hora não lhe ocorrera outro insulto. Fora a última palavra que lhe dissera. E ela o chamara de farsante. Achou melhor não perguntar pela mãe dela.[/size]
Ela o encontrou pensativo em frente aos vinhos importados. Quis virar, mas era tarde, o carrinho dela parou junto ao pé dele. Ele a encarou, primeiro sem expressão, depois com surpresa, depois com embaraço, e no fim os dois sorriram.

Tinham estado casados seis anos e separados, um. E aquela era a primeira vez que se encontravam depois da separação. Sorriram e ele falou antes dela; quase falaram ao mesmo tempo.

- Você está morando por aqui?

- Na casa do papai.

Na casa do papai! Ele sacudiu a cabeça, fingiu que arrumava alguma coisa dentro do seu carrinho - enlatados, bolachas, muitas garrafas -, tudo para ela não ver que ele estava muito emocionado.

Soubera da morte do ex-sogro, mas não se animara a ir ao enterro. Fora logo depois da separação, ele não tivera coragem de ir dar condolências formais à mulher que, uma semana antes, ele chamara de vaca. Como era mesmo que ele tinha dito?

"Tu és uma vaca sem coração!" Ela não tinha nada de vaca, era uma mulher esbelta, mas não lhe ocorrera outro insulto. Fora a última palavra que lhe dissera. E ela o chamara de farsante. Achou melhor não perguntar pela mãe dela.

On the left is voscribe, taken from the audio book. On the right is the text of the book.

You can judge for yourself, but I think it did an excellent job for free here. The free version has limited languages- French is included. The audio is limited to under 10 minutes. Maybe you should try it out. There is a paid version. Try it with French and see how it works for you. I don't know how it will work with tv/conversational/podcast audio. The website claims to be able to create subtitles.

I would've loved to have had something like this, not bad, when I was actively learning Spanish and Portuguese. Shame it doesn't cover my other languages.
7 x

User avatar
Amandine
Orange Belt
Posts: 180
Joined: Tue Nov 02, 2021 8:45 am
Location: Sydney, Australia
Languages: English (N), French (B1/B2), Russian (B1), Romanian (A1, casual playing on Duolingo), Yiddish (ditto)
x 919

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby Amandine » Thu Feb 22, 2024 7:55 pm

I have been using Maestra for two weeks or so, which I found by googling something like "ai transcription francais". It worked for me so I stopped my search there, potentially there are better ones but I'm satisfied. Its $ 10 USD for an hour.

I have mostly used it for YouTUbe videos but the most challenging one was Koh Lanta, the French version of Survivor. I recorded audio of an episode on the voice notes app on my phone and ran that through Maestra. So the source wasn't great audiio, they use a lot of informal spoken French and there is constantly music in the background but it did a very quick and extremely accurate job. Well, it's long so I haven't gone through most of it but the opening few minutes are basically perfect.

I also like the interface, the audio is included at the top of the transcript and the words change colour as it plays so you can follow along very easily. It doesn't always distinguish Speaker 1 and Speaker 2 etc correctly, that's the main editing I have to do of the transcripts. It has timestamps but only about once a minute.

I have a lot of qualms about this whole area and if 80% of Silicon Valley was pitched into the sun we'd be better off BUT I must say this is a game changer for me learning French, particuarly as I'm focusing on actually spoken French this year.

Edit: It says it has a lot of languages available, list here.
Last edited by Amandine on Fri Feb 23, 2024 8:22 am, edited 1 time in total.
8 x

User avatar
Adrianslont
Blue Belt
Posts: 827
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Learning Indonesian and French
x 1936

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby Adrianslont » Fri Feb 23, 2024 8:10 am

I use subtitle edit which I know you are familiar with, emk.

You can use various versions of whisper or Vosk from within that.

It’s really just one extra simple step in my subs2srs workflow and it solves the issue of not having TL subs or in accurate TL subs.

Whisper offers more language options but I have been unable to get it to work. Others have though. Vosk works fine for French.

Recommend you download the latest version of subtitleedit and look under the video menu.

And it’s the right price. Free!

I have also used the web service https://freesubtitles.ai/ which has a freemium model. It worked exceptionally well for Indonesian which Vosk doesn’t handle but has now gone flaky on me and I can no longer successfully use it. Maybe it works if you pay but I’m not going to risk it.

I agree, speech to text is an exciting innovation for language learners and I imagine it will only improve and any friction will be reduced.
5 x

zac299
White Belt
Posts: 23
Joined: Fri Feb 09, 2024 2:43 am
Languages: English (N)
Spanish (Beginner)
x 82

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby zac299 » Fri Feb 23, 2024 9:03 am

Emk I've done a lot of playing around for exactly what you're describing, except for my professional work, not to aid my language learning.

I'm sorry to say I'd be surprised if you find an answer that comes anywhere close to what I think you'd consider success with the type of audio input you're talking about (Far, far from perfect and eloquent like I'd imagine iguanamon's audiobook content would be).

If money isn't an issue, for most foreign languages, you'd be able to hire someone in the native language to take their time and do a literal transcription for you.

Most languages will have very cheap labour available for you... So even multi-hour pieces of audio wouldn't actual cost you too much to get transcribed to native-level of quality.

Sites like Upwork will give you a wide array of such freelance talent, but you'll overpay because everything's charged in USD and there's middle-men fees.

If it's not a problem for you, give the same 10 minutes of audio to 5 different workers. Pick the person who does the best job, then communicate with them away from Upwork and pay them directly for subsequent jobs. You'll get cheaper labour very quickly that way AND find the best person for your job.

It's probably not the answer you want, but as I said, I've searched a LOT for the exact solution you're describing (Because it would literally save me $1,000s/year of my time professionally if I could find something 99.9% perfect) which is automated.

But, as of yet, I don't think it exists.
4 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4792
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 15066

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby Iversen » Fri Feb 23, 2024 9:55 am

It may be slightly off-topic, but I have noticed that for some languages Youtube now as a general rule offer same-language subtitles - like for instance English, Spanish and (lo and behold) Dutch. For other, not yet - like my Slavic target languages. But maybe that facility will come soon since the process seems to be gaining momentum right now. I am pretty sure that those subtitles are made by some automatized speech recognition software, but for a language learner trying to hang on to speech in a weak language they can still be useful in spite of their errors. I often watch videos about scientific topics, and it's obvious that the machine hasn't been trained specifically on that kind of stuff - and the results can be hilarious. But those subtitles are still helpful - I just wish they would help me with languages I can't already understand without help.
2 x

User avatar
emk
Black Belt - 1st Dan
Posts: 1710
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6746
Contact:

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby emk » Fri Feb 23, 2024 1:08 pm

A huge thank you to everyone who has responded! You've definitely pointed me in some interesting directions. I will need to pull a sound track from something slangy, fast and accented, and see how well it works with the tools you've mentioned.

zac299 wrote:I'm sorry to say I'd be surprised if you find an answer that comes anywhere close to what I think you'd consider success with the type of audio input you're talking about (Far, far from perfect and eloquent like I'd imagine iguanamon's audiobook content would be).

This is disappointing to hear. I'd heard some very good things about Whisper in particular; it sounded like they might have actually made real progress, like they did in several other areas.

The reason why I'm curious about this is because I strongly suspect that Subs2SRS-style audio cards have some untapped potential at all levels:

  1. At A1 and A2, the right kind of audio cards allow you to have listening as your strongest(!) skill instead of your weakest. And you can essentially create your own "Assimil course" out of any native media that would normally require you to be a good chunk of the way to B2. Having tried this, it's actually better than Assimil in many ways, and far better than Duolingo.
  2. At B1 and B2, audio cards allow you to completely understand almost any chunk of difficult audio that you select. The dream here is to be able to watch TV and movies normally, and have a magic button that says, "Turn the last 30 seconds of audio into cards, so I can figure that out later." Some browser plugins can do this for Netflix, but I don't think we've exhausted the possibilities here. But at this level, you want to be very selective—there's no point in turning entire episodes into cards after your first season or three of TV. You want to quickly snag some material, and to keep on watching TV.
  3. At C1 and C2, I suspect audio cards should allow faster progress with slangy speech and regional accents, and the general challenge of understanding virtually everything on TV. You'd want a UI similar to what I described for B1 & B2, but the audio will be very difficult for text-to-speech tools.
The idea here at B1 and above is balancing "extensive" listening work (which starts being incredibly productive) with "intensive" listening study. I think it makes sense to spend 10-20% of your listening time on intensive study, but this would presumably be much more efficient if you could effortlessly extract the hard parts into SRS cards without interrupting your extensive activities every 2 minutes.

I don't think paying humans to transcribe is a feasible model here. But some kind of API would be great, even if it cost money.

It's probably not the answer you want, but as I said, I've searched a LOT for the exact solution you're describing (Because it would literally save me $1,000s/year of my time professionally if I could find something 99.9% perfect) which is automated.

I think this would start to be worthwhile if I could get accurate transcripts for 4 subtitles out of 5. Which shouldn't require 99.9% accuracy.

Of course, you'd probably want to replace Anki with something better integrated, and with very different defaults.
2 x

User avatar
iguanamon
Black Belt - 2nd Dan
Posts: 2363
Joined: Sat Jul 18, 2015 11:14 am
Location: Virgin Islands
Languages: Speaks: English (Native); Spanish (C2); Portuguese (C2); Haitian Creole (C1); Ladino/Djudeo-espanyol (C1); Lesser Antilles French Creole (B2)
Studies: Catalan (B2)
Language Log: viewtopic.php?t=797
x 14269

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby iguanamon » Fri Feb 23, 2024 4:21 pm

Just tried voscribe with the pre-revolutionary Cuban radio show "La Tremenda Corte. This is a radio program from the 1950's with around five characters. I had the 15 minute audio cut to about seven minutes. Let's see how it did. I consider the character "Tres Patines" slangy and difficult to understand. Many second-language Spanish-speakers have difficulty in understanding Cuban accented Spanish. Lots of slang too. The Tres Patines character is noted for butchering Spanish and coming up with lots of word play between himself and Señor Juez (Mr. Judge). Here's the audio. The transcription comes in at 01:25

I chose this episode because I have a transcript for it. Of course, the transcript is formatted differently than voscribe. The parallel format here is hard to get right. Voscribe does play the audio with the words lit up as they are spoken. Voscribe on the left; Original Transcript on the right:
La Tremenda Corte- Relampaguicidio wrote:Voscribe
Sí es verdad, doctor. A un chino que vive en el cuarto número 5, como siempre anda con la ropa muy remendada, le puso amarillo con ropa vieja. ¡Mi señor! A mí, como tengo una casita en la playa de Santa Fe, me dice el gallego santaféisimo. Y a usted, como siempre se está quejando de enfermedades que no tiene, ¿sabe cómo le dice?
¿Cómo me dice? El puente sobre el río Kwai. ¿Y eso por qué? Porque dice que usted también es medio tocado y medio chiflado. Ah, sí, secretario.
Sí. Póngale 20 pesos de multa en Ananina. Con permiso, perdóneme que coarte. Sin que eso sea meterme, señor juez, lo que no me importa, a mí me parece, tengo la impresión, de que 20 pesos de multa por un puentecito sin importancia, es un abuso que el tribunal está cometiendo con esa desgraciada mujer. Oiga, oiga, suspenda eso de desgraciada, ¿sabe?
No, que no me meta la pata, señora, que la estoy defendiendo. No, no me defienda, déjeme. De manera que 20 pesos de multa por ese puentecito le parece a usted demasiado, ¿no es así, Tres Patines? Sí, señor, yo suplico al tribunal Que sea buen émbolo. ¿Cómo buen émbolo?
¿Eh? ¿Cómo buen émbolo? Buen émbolo, que sea noble. Venébolo. Venébolo, sí.
Venébolo. Que caperucite. Que caperucite, que recapacite. Que recapacite. Y que en lugar de ponerle veinte pesos, pues le ponga cuarenta nada más.
¿Qué dice? Pero ahí usted me está defendiendo. Hombre, claro que sí. ¿Y cómo pide usted que me pongan cuarenta pesos en lugar de veinte? Señora, no sea zanahoria rebozada.
Yo no digo cuarenta pesos, sino cuarenta centavos. Bueno, pues aclare bien eso, porque seguro que el juez te entendió pésimo y lo manda por la cabeza. No, no, no. Usted está equivocada. El señor juez es bruto, pero no tanto.
Si es verdad, doctor. A un chino que vive en el cuarto número 5, como siempre anda con la ropa muy remendada, le puso "amarillo con ropa vieja"... si señor. A mí como tengo una casita en la playa de Santa Fé, me dice "el gallego santafeísimo". Y a usted como siempre se esta quejando de enfermedades que no tiene, ¿Sabe como le dice? Cómo me dice? El puente sobre el río Kwai ¿Y eso por qué? Porque dice que usted también es medio tocado y medio chiflado. Secretario, póngale 20 pesos de multa a Nananina. ¡Con permiso!, perdóname que coarte... sin que eso sea meterme señor Juez en lo que no me importa, a mi me parece, tengo la impresión... Si... ...de que 20 pesos de multa por un puentecito sin importancia es un abuso que el tribunal esta cometiendo con esa desgraciada mujer. ¡Oiga, oiga!, Shhh, oiga, suspenda eso de desgraciada, ¿sabe? ¡Cállese la boca!, no meta la pata señora, que la estoy defendiendo. No, no me defienda, dejese de eso.

De manera que 20 pesos de multa por ese puentecito le parece a usted demasiado, ¿no es así
Trespatines? Trespatines Si, señor Juez. Yo suplico al tribunal que sea "buen émbolo", y que...
¿Cómo "buen émbolo"? Ehh? ¿Cómo "buen émbolo"? "Buen émbolo", que sea noble.
Benévolo Benévolo, si... Benévolo ...que caperucite ¿Que caperucite qué? Que piense.
¡Que recapacite! Recapacite... y que en lugar de ponerle 20 pesos, pues le ponga 40, nada más...
¿Qué dice?, ¿pero oiga, usted me esta defendiendo? Hombre, claro que si, chico
¿Y cómo pide usted que me pongan 40 pesos en lugar de 20?
Señora, no sea zanahoria rebozada... yo no digo 40 pesos, sino 40 centavos
Bueno, pues aclare bien eso, porque seguro que el Juez entendió pesos y me los manda por la
cabeza No, no, no... usted esta equivocada, el señor Juez es bruto, pero no tanto...

So this is not perfectly clear audio. It's slangy, Cuban Spanish. It messed up "Benévolo" to "Venébolo". An understandable mistake with "B"/"V" in Spanish. Still, not bad at all, I'd say.
I don't use this, or any other transcription tools for Spanish or Portuguese. I don't need them. Years ago, it would have been quite useful.
5 x

User avatar
Axon
Blue Belt
Posts: 776
Joined: Thu Jun 16, 2016 12:29 am
Location: California
Languages: Native English, in order of comfort: Mandarin, German, Indonesian,
Spanish, French, Russian,
Cantonese, Vietnamese, Polish.
Language Log: viewtopic.php?f=15&t=5086
x 3300

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby Axon » Fri Feb 23, 2024 6:08 pm

emk wrote:[*]At A1 and A2, the right kind of audio cards allow you to have listening as your strongest(!) skill instead of your weakest. And you can essentially create your own "Assimil course" out of any native media that would normally require you to be a good chunk of the way to B2. Having tried this, it's actually better than Assimil in many ways, and far better than Duolingo.
[*]At B1 and B2, audio cards allow you to completely understand almost any chunk of difficult audio that you select. The dream here is to be able to watch TV and movies normally, and have a magic button that says, "Turn the last 30 seconds of audio into cards, so I can figure that out later." Some browser plugins can do this for Netflix, but I don't think we've exhausted the possibilities here. But at this level, you want to be very selective—there's no point in turning entire episodes into cards after your first season or three of TV. You want to quickly snag some material, and to keep on watching TV.
[*]At C1 and C2, I suspect audio cards should allow faster progress with slangy speech and regional accents, and the general challenge of understanding virtually everything on TV. You'd want a UI similar to what I described for B1 & B2, but the audio will be very difficult for text-to-speech tools.[/list]
The idea here at B1 and above is balancing "extensive" listening work (which starts being incredibly productive) with "intensive" listening study. I think it makes sense to spend 10-20% of your listening time on intensive study, but this would presumably be much more efficient if you could effortlessly extract the hard parts into SRS cards without interrupting your extensive activities every 2 minutes.

I don't think paying humans to transcribe is a feasible model here. But some kind of API would be great, even if it cost money.


Whisper already has an API with documentation here: OpenAI Speech to Text. I've used the step of GPT-4 post processing before and it's very reliable in English.

According to this article, finetuning Whisper is not very difficult and just requires normal transcripts of whatever audio you want. That is, it doesn't have to be timed by words or in any weird format. A Comprehensive Guide for Custom Data Fine-Tuning with the Whisper Model

I've finetuned text models with little success (yet) and image models with much more success. When you have the right high-quality data to put in, the end result is worlds better than the base model. With Whisper large-v2 already very accurate for major languages, I'm optimistic that it could be finetuned further to capture less-standard accents. You could automate the training data production if you had an existing corpus of audio with subtitles in the language you want, such as slangy movies or TV shows already transcribed. Just split the audio into chunks of less than 30 seconds by following timecodes and making sure not to split in the middle of words. The finetuning would cost money, of course, but GPU time gets cheaper and cheaper every month.

To your point about levels, I'm not sure that's how it's worked in my experience. The only languages I approach C1 listening in are Mandarin and German, and each of those has taken me a decade to get there. I've always been a very listening-heavy learner and I don't feel that sentence cards would help much. If there's an accent I need to get used to, I get there by following a transcript of a conversation or monologue. Usually the problem is with vocabulary instead of not understanding words I already "know." Now, I do definitely use sentence cards for production. I watch shows and write down interesting sentences that I think are a good way of expressing ideas, or I grab sentences that have new words so I have context for the vocabulary study. The listening isn't the issue though.

At the beginning levels, I think you're right on. You could automate the production of a "hybrid" sentence deck, where sentences taken from native media, like with subs2srs, are supplemented with LLM-generated sentences at a similar level or filling in gaps for the learner. Harder sentences from the same native media could be simplified and read with TTS voices, then shown to the learner as a way to prep them for the more difficult content.
2 x

User avatar
rdearman
Site Admin
Posts: 7264
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23341
Contact:

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Postby rdearman » Sat Feb 24, 2024 12:24 am

Amazon transcribe on Aws does a good job. And you have an account :D it's not free but costs pennies
1 x
: 39 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: squirrel and 1 guest