Best automatic speech-to-text for challenging audio? Whisper? Something else?

Carl · Postby **Carl** » Sat Feb 24, 2024 10:41 pm

I really like Whisper, which I use through MacWhisper (https://goodsnooze.gumroad.com/l/macwhisper). After I first fooled around with a couple clumsy ways to install Whisper, I discovered an early version of MacWhisper. Since then, I've used nothing else. I've been delighted both by the program and the attention the programmer pays to continually improving it. It seems like he updates it every week or so, and he has a roadmap showing future improvements he plans to make.

A use I've found for it that I haven't seen mentioned before on this site: Now, when I want to read and listen to a book, I buy only the audiobook and make the e-book by running the audio through MacWhisper. MacWhisper can handle batch transcriptions, so if the audiobook is divided into chapters, then the program outputs a separate transcription file for each chapter.

I've given it some audio in Spanish or German that I've listened to a bunch of times without being able to understand a key part, and it's given an accurate transcript.

In the past, Whisper has had problems with audio containing multiple languages. It would transcribe everything in just one of the languages (Whisper itself has built-in translation capability) or just translate the entire transcription into English. I tested that today, and I was impressed.

For kicks and giggles, I gave MacWhisper audio with Norwegian, Swedish, and Danish speakers in conversation (https://www.youtube.com/watch?v=IOnetGsUtLE). For YouTube and some other audio or video files online, you can just paste the URL into MacWhisper and let it do the transcription from there; there's no need to download the audio in a separate step. Using the Large model and the input language set to Auto, I got a reasonably good transcription that changed languages with the speakers. (I checked the first 10 minutes or so.)

It wasn't perfect--when someone broke in with a quick comment in a different language, the transcription was mangled. And at least once, it introduced a Norwegian- or Danish-influenced misspelling of a common word into a long stretch of clear Swedish. Apart from that and some other small errors, it was quite accurate.

For subtitles in two languages, or bilingual parallel texts, MacWhisper will translate its transcription, with the help of the user's free DeepL key.

I like MacWhisper so much that I upgraded my 2015 MacBook Pro to a 2020 Mac mini last year, years before I otherwise would have replaced the MBP, in large part because I wanted to be able to do MacWhisper transcriptions quickly.

MorkTheFiddle · Postby **MorkTheFiddle** » Sun Feb 25, 2024 7:45 pm

After reading Carl’s post, I set up Google Colaboratory with my Windows 11 box
to run Whisper AI.
A post by Kevin Stratvert called Best FREE Speech to Text AI - Whisper AI

shows how to set Whisper AI up and run on Google Colaboratory .

Then I ran a 4:20 minute clip in French from the beginning of Agata Kristof’s Le grand cahier. I thought that the large.en model had too many errors, so I ran Whisper again with the large model. Note that the medium model must be written medium.en, but the large model throws an error with an extension. If I read everything aright, the medium.en model took 21 seconds, and the large model took 2 minutes.

Later, my listening to the audio and reading the transcription brought up no errors in the transcription save for a few spellings (d’y for dit, for example) and missing accent marks.

So Whisper AI may be the way to go. Later on I'll give it a shot with some Spanish.

Carl · Postby **Carl** » Sun Feb 25, 2024 8:15 pm

MorkTheFiddle wrote:
Then I ran a 4:20 minute clip in French from the beginning of Agata Kristof’s Le grand cahier. I thought that the large.en model had too many errors, so I ran Whisper again with the large model. Note that the medium model must be written medium.en, but the large model throws an error with an extension.

The Whisper models with the .en extension are designed for English-only speech. For languages other than English, the models to use do not have the .en extension. While there is a large multilingual model, there is no large.en model, which is probably why it threw an error when you specified large.en.

There's an overview of languages and models here: https://github.com/openai/whisper/blob/main/README.md#available-models-and-languages

Granrey · Postby **Granrey** » Mon Feb 26, 2024 9:28 pm

You can try samsung transcribe if you have samsung phone.
I find it very handy

MorkTheFiddle · Postby **MorkTheFiddle** » Tue Feb 27, 2024 5:52 pm

Following up on my post about Whisper yesterday, I ran a Spanish segment through WhisperAI. The segment that I used was a 3m 52s trimming (1819 kb) from the first first-season episode of the Telemundo Mexican soap Juegos de Mentiras
https://www.youtube.com/watch?v=20BAzoVJBeU&list=PLARNzRPnDAQ36poWsTOs2ezBH_ytuBC3m

Using the large model took 29m 57s. The medium model took longer, so I aborted it. Using the default small model

!whisper "Juegos.mp3" --language Spanish

took 5m 31s.
I would say the default model gave 95% accuracy, though it did miss altogether a bit of singing (but included some other singing). This model seems preferable to the large model for the savings in time that it affords.
A robust machine running Whisper locally might give faster times, I suppose.

Axon · Postby **Axon** » Tue Feb 27, 2024 6:25 pm

MorkTheFiddle wrote:A robust machine running Whisper locally might give faster times, I suppose.

I use Whisper on Colab and locally. When you have even a regular consumer Nvidia graphics card installed in your machine, the speedup over what you've described is incredible even with a large model. I only use Colab when I'm traveling with my laptop and need something transcribed. Colab Pro is $10 per month and allows you about 20 hours of compute time with the fastest available hardware, and up to several hundred hours with the lowest tier of "pro" hardware that surpasses what's available with a free instance.

I'm not a real software developer, but I'd like to be one day. If there was a web service for creating transcripts oriented towards language learning, what would you like it to be able to do?

Carl · Postby **Carl** » Wed Feb 28, 2024 1:40 am

Axon wrote:When you have even a regular consumer Nvidia graphics card installed in your machine, the speedup over what you've described is incredible even with a large model.

Yep. I pulled a 5:34 segment from the same Juego de Mentiras video and ran it on the Large model on MacWhisper, specifying that the language is Spanish. It took about a minute and 40 seconds to transcribe. This was on a 2020 Mac mini (using the Apple M1 chip), with 16 GB RAM.

Amandine · Postby **Amandine** » Sun Mar 03, 2024 9:05 am

Carl wrote:I really like Whisper, which I use through MacWhisper (https://goodsnooze.gumroad.com/l/macwhisper). After I first fooled around with a couple clumsy ways to install Whisper, I discovered an early version of MacWhisper. Since then, I've used nothing else. I've been delighted both by the program and the attention the programmer pays to continually improving it. It seems like he updates it every week or so, and he has a roadmap showing future improvements he plans to make.

Thanks for this Carl. Since I use a MacBook, I bought the pro version of MacWhisper. It's appealing to pay for one Pro version for unlimited transcriptions, rather than the per hour model of Maestra.

Based on a few comparisions the quality of the transcriptions between MacWhisper and Maestra, which I had been using, are basically equivalent. They both make little mistakes - sometimes the same mistakes, sometimes different ones - but they're so small and/or understandable (like unusual anglicisms a youtuber made up on the spot) that it doesn't matter.

What I like better about MacWhisper:
- It breaks up the transcript into much more logical sections, whereas Maestra does these weird chunks that end the papragraph in the middle of a sentence and add a random new Speaker title. This is easy to edit in Maestra but still MW is significantly superior with this.
- the "Segment" option which displays it sentence by sentence.

On the other hand, with Maestra I can jump to anywhere in the transcript and the audio follows. So I can click back to a certain phrase or word if I want extra practice hearing it or shadowing it multiple times. As far as I can work out, there's no way to do this on MW - it's from the start or nothing. Also, I can edit the transcript in Maestra but don't seem to be able to do in MW, within the app. I can download it into a .txt file but I'd prefer to be able to do it within the app so I can have the audio following long. These two are a bit of big deal for me with how I'm using it for languages.

A smaller thing I prefer about Maestra is that when you import directly from YouTube (which MW can also do) the video itself is embedded in the transcript screen and follows along.

Still, I like MW and I think I'll be going back and forth between the two depending on exactly what the audio is. Until I can find something that does everything!

Carl · Postby **Carl** » Thu Mar 07, 2024 3:26 am

Amandine wrote:On the other hand, with Maestra I can jump to anywhere in the transcript and the audio follows. So I can click back to a certain phrase or word if I want extra practice hearing it or shadowing it multiple times. As far as I can work out, there's no way to do this on MW - it's from the start or nothing. Also, I can edit the transcript in Maestra but don't seem to be able to do in MW, within the app. I can download it into a .txt file but I'd prefer to be able to do it within the app so I can have the audio following long. These two are a bit of big deal for me with how I'm using it for languages.

Thanks for the transcription accuracy comparison and other comments, Amandine. I think I can help you with both of these issues with MacWhisper. I haven't used Maestra, so I can't compare them.

I would also like MacWhisper to have the option of clicking on a phrase and having the audio play from there. There are some workarounds:
1) You don't have to start audio playback from the beginning; there's a scrubbing control at the bottom of the window. As you scrub through the audio (in either the Transcript or Segments tab), the current chunk of text appears above the scrubbing control. So if you know the text reasonably well, you can speed-read while scrubbing to find the phrase or word, and then scrub back to it as many times as you want.
2) You could export the transcript in a format with timestamps, e.g., .srt, and then search for the phrase or word in question. Then you can scrub to where it appears and start playback from there. Scrub back to it to repeat; since the phrases appear over the scrubbing control, it's relatively easy to go back just one phrase.

But when I want to do a lot of repeat listening to a phrase, or shadowing, I export the transcript as subtitles in .srt and put them plus the .mp3 into WorkAudioBook (only available for Windows or Android, unfortunately), which is another pay-once program. If you want to stay on the Mac, you can do it on Audacity, too, for free. There's a spreadsheet tool available called Audacity Labels (TXT) Subtitles (SRT, SBV) Convertor that you might find helpful to view the transcript in Audacity.

As for editing the transcript, that's certainly possible within MacWhisper. Make sure you're viewing the transcript in the Segments view, not the Transcript view (in the Transcript/Segments/ChatGPT tabs at the top of the window).

Amandine · Postby **Amandine** » Sat Mar 09, 2024 11:23 pm

Carl wrote:I think I can help you with both of these issues with MacWhisper. I haven't used Maestra, so I can't compare them.

I think you must have manifested the editing ability into existence with this comment, because I am sure I couldn't do it when I tried multiple times but now I can. Don't know what I was doing there ... :oops:

Thanks also for the workarounds, I personally am not going to do any of them on account of being very lazy but it's nice to know they are there.

I'll try scrubbing within the app and see if I can get that to work.

A language learners’ forum

Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Re: Best automatic speech-to-text for challenging audio? Whisper? Something else?

Who is online