Open AI - Whisper Translation Tool

General discussion about learning languages
User avatar
elAmericanoTranquilo
White Belt
Posts: 39
Joined: Sun Sep 18, 2022 5:11 pm
Languages: English (N), Spanish (B1)
Language Log: https://forum.language-learners.org/vie ... 15&t=18495
x 123

Open AI - Whisper Translation Tool

Postby elAmericanoTranquilo » Tue Oct 04, 2022 2:56 am

Whisper, a new free translation and transcription tool from Open AI came out last month:
https://openai.com/blog/whisper/
https://www.theverge.com/2022/9/23/2336 ... pen-source

Using it right now requires a computer and some comfort with command line tools, but I was able to get it working on my computer relatively quickly. Once installed, it enables you to transcribe an audio file using a single command. I tried it on a YouTube video of a Spanish song that I had downloaded (using pytube) and the transcription worked very well - I just had to make one correction.
Last edited by elAmericanoTranquilo on Sun Oct 09, 2022 7:06 am, edited 1 time in total.
4 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: Open AI - Whisper Translation Tool

Postby ryanheise » Tue Oct 04, 2022 1:36 pm

Yes, it's quite an impressive system. It is capable not only of transcribing audio but also translating any audio into English, and it accomplishes this audio to English process directly without first transcribing into the native language and then translating the transcription.

However, there are some notable limitations:

1. The algorithm can sometimes get into a loop where it starts repeating the same output over and over again without tracking what is actually being said.
2. It will sometimes make up text that was never actually spoken in the audio, based on what it predicts people would normally say in these situations (more on this below).
3. Unlike other algorithms, it does not give you individual word timestamps, only segment timestamps.
4. You will need quite a powerful computer in order to be able to run the most accurate model, and in particular, a powerful NVIDIA GPU.

Interestingly, it is believed that point 2 is caused by training data that wasn't sufficiently cleaned. Basically, in order to train the AI, they used a huge compilation of multiple databases of audio where they had in possession both the audio and the transcript, and this process assumes that the transcript used to train the AI was accurate in the first place. The preprocessing step cleans the data by removing parenthesised words which are typically unspoken. Sometimes the heuristic is wrong and it ends up removing things such as the names of people, and you end up with the AI going through the training process hearing a word, but seeing an empty hole in the transcript, and so it ends up just making stuff up.

The algorithm is also less reliable for languages where there is less training data. One of the interesting findings was that the AI showed a misleadingly high accuracy for one rare language, and it turned out on inspection of the training data that most of the transcripts in the database for that language were actually English and were mis-tagged as another language. This could be because people are lazy and don't bother labeling things sometimes, and in those cases, English would have been chosen as the default, or in many cases because the transcript was actually a subtitles file that was already translated into English rather than being a genuine transcript. This noise would be present in the data for all languages but in more popular languages this noise would be eclipsed by the valid data and would only cause a problem for the rare languages. Something to be aware of if using this tool to transcribe audio in one of those rarer languages.

If you have managed to run the tool, I am interested to know, what GPU do you have if any and what is the computation time?
Note that if you don't have a GPU, running the large model could take up to a couple of days just to transcribe 15 minutes. So it's not a tool that will be accessible to everyone.
9 x

User avatar
elAmericanoTranquilo
White Belt
Posts: 39
Joined: Sun Sep 18, 2022 5:11 pm
Languages: English (N), Spanish (B1)
Language Log: https://forum.language-learners.org/vie ... 15&t=18495
x 123

Re: Open AI - Whisper Translation Tool

Postby elAmericanoTranquilo » Sun Oct 09, 2022 12:10 am

Thanks for the additional info about whisper - it's very interesting. Today I did a test using whisper to transcribe a 48 minute YouTube Spanish audio book that I downloaded from YouTube (De como San Nicolás llego a Simpson's Bar - Bret Harte - Leído por Victor Villarraza).

I have a 2021 MacBook Air M1 - the entry level model. It has a 7 core GPU, but there are definitely faster machines available. It took whisper 2 hours and 6 minutes to to transcribe the video. From my limited use of it so far, it seems to create pretty high quality transcriptions, at least with the Spanish content I've given it.

The initial way I'm thinking of using it is to transcribe YouTube videos of songs that have custom captions built into imagery already. In that case the captions will be the canonical source of truth for the transcriptions, but it will save me some time to let whisper do a first pass at the transcription. Then I'll make manual corrections while I'm studying the videos. I can report back afterwards on the level of accuracy, in case people are interested.

As far as the accuracy goes for the transcription of the 48 minute audio book reading, I don't have an easy way to measure it comprehensively, but here's a comparison of two randomly selected paragraphs (about 10 minutes into the track):

Transcribed by whisper:
¡Viejo! ¿Y cómo sigue tu niño Juanito? Se me figuró algo enfermizo la última vez
que lo vi en el camino tirando piedras a los chinos y no parecía interesarle eso en gran manera.
Ayer pasó por aquí una tropa de ellos ahogados en el río y pensé en Juanito.
¡Oh! ¿Cómo los echaría de menos? ¡Tal vez estorbaremos si está enfermo!

Visiblemente afectado, no sólo por este cuadro patético de la privación de Juanito,
sino también por tan circunspecta delicadeza, se apresuró el padre a asegurarle que Juanito
estaba mejor y que un poco de broma quizá le mejoraría a algún tanto.

From the epub for the book that is being read in the video:
Viejo, ¿y cómo sigue tu niño Juanito? Se me figuró algo enfermizo la última vez
que lo vi en el camino tirando piedras a los chinos, y no parecía interesarle eso en gran manera.
Ayer pasó por aquí una tropa de ellos, ahogados en el río, y pensé en Juanito.
¡Oh! ¡cómo los echaría de menos! ¿Tal vez estorbaremos si está enfermo?

Visiblemente afectado, no sólo por este cuadro patético de la privación de Juanito,
sino también por tan circunspecta delicadeza, se apresuró el padre a asegurarle que Juanito
estaba mejor y que un poco de broma quizá le mejoraría algún tanto.
Last edited by elAmericanoTranquilo on Sat Dec 03, 2022 5:17 am, edited 1 time in total.
1 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: Open AI - Whisper Translation Tool

Postby ryanheise » Mon Oct 10, 2022 2:40 am

elAmericanoTranquilo wrote:Today I did a test using whisper to transcribe a 48 minute YouTube Spanish audio book that I downloaded from YouTube (De como San Nicolás llego a Simpson's Bar - Bret Harte - Leído por Victor Villarraza).

I have a 2021 MacBook Air M1 - the entry level model. It has a 7 core GPU, but there are definitely faster machines available. It took whisper 2 hours and 6 minutes to to transcribe the video.


Unfortunately, unless you're using a GPU with with "CUDA" support, that GPU won't be used. So in your case, the 7 core GPU will likely be ignored and it will just use your CPU.

I also assume you were testing the default "small" model which is not the most accurate model? Have you tried the "medium" or "large" models? Note that these require 5GB and 10GB of VRAM respectively and significantly more compute time.
3 x

User avatar
elAmericanoTranquilo
White Belt
Posts: 39
Joined: Sun Sep 18, 2022 5:11 pm
Languages: English (N), Spanish (B1)
Language Log: https://forum.language-learners.org/vie ... 15&t=18495
x 123

Re: Open AI - Whisper Translation Tool

Postby elAmericanoTranquilo » Mon Oct 10, 2022 3:35 am

ryanheise wrote:
elAmericanoTranquilo wrote:Today I did a test using whisper to transcribe a 48 minute YouTube Spanish audio book that I downloaded from YouTube (De como San Nicolás llego a Simpson's Bar - Bret Harte - Leído por Victor Villarraza).

I have a 2021 MacBook Air M1 - the entry level model. It has a 7 core GPU, but there are definitely faster machines available. It took whisper 2 hours and 6 minutes to to transcribe the video.


Unfortunately, unless you're using a GPU with with "CUDA" support, that GPU won't be used. So in your case, the 7 core GPU will likely be ignored and it will just use your CPU.

I also assume you were testing the default "small" model which is not the most accurate model? Have you tried the "medium" or "large" models? Note that these require 5GB and 10GB of VRAM respectively and significantly more compute time.

Ah I see, yes I just used the defaults. Running the small model didn't use a noticeable amount of CPU for me - I was blissfully multitasking on other stuff while it was running; when I used top to check CPU usage it showed a system load of around 3... For what I'm trying to do with it so far, the small model seems adequate, but it's good to know that the larger model sizes are available.
2 x

User avatar
JævligFaen
Posts: 5
Joined: Tue Oct 11, 2022 9:24 am
Languages: English (N), Portuguese-PT (B1)
x 12

Re: Open AI - Whisper Translation Tool

Postby JævligFaen » Thu Oct 13, 2022 8:53 pm

I tried it for Portuguese.

The tiny model detected it as Polish, so that didn't work.
Large model detected it correctly as Portuguese, but ended with the message "killed"
Finally I tried medium, and it worked. It took a while but it was accurate.
Using an Nvidia GTX 1060.
3 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: Open AI - Whisper Translation Tool

Postby ryanheise » Fri Oct 14, 2022 8:06 am

I am actually quite grateful that OpenAI has shared this technology with the world, and I'm excited what this means for the next generation of language learning apps. The tiny and base models are small enough to be embedded into mobile apps, and the medium and large models are now cost effective enough to build server apps out of.

By the way, an A100 chip ($10,000 USD) would be able to transcribe using the large model at 100x faster than real time. On the other end of the spectrum the RTX 3060 ($400 USD) would run the large model at around 2x real time and the medium model at around 10x real time.

JævligFaen wrote:I tried it for Portuguese.

The tiny model detected it as Polish, so that didn't work.
Large model detected it correctly as Portuguese, but ended with the message "killed"
Finally I tried medium, and it worked. It took a while but it was accurate.
Using an Nvidia GTX 1060.


Although not the most accurate, you can still try the tiny model by forcing it to use Portuguese:

Code: Select all

whisper audio.mp3 --model tiny --language pt --task transcribe


Comparing with Google, it's amazing that Google was able to compress one of it's high quality models down to a comparably small size, which is what they use in their Pixel phones to do speech recognition on device. The appeal of that is that you can talk to your phone without any chance that Google is spying on what you're saying. On non-pixel devices, when you dictate to your phone, your voice is usually sent to Google's servers to do the transcription on their highest quality model. You can read about their tiny model here. Although Whisper's large model actually far outperforms Google's best large model in terms of word accuracy.
4 x

User avatar
TopDog_IK
Yellow Belt
Posts: 80
Joined: Thu Aug 04, 2022 4:21 am
Languages: English (N), German (B2/C1)
x 79

Re: Open AI - Whisper Translation Tool

Postby TopDog_IK » Tue Oct 18, 2022 11:26 pm

We've been testing Whisper with some short German anime dub and reality TV audio clips. The results from the Large model are very impressive, even with extremely fast speech, dominant background music, etc. In some cases, there are zero mistakes. But other content fails miserably.
2 x

User avatar
TopDog_IK
Yellow Belt
Posts: 80
Joined: Thu Aug 04, 2022 4:21 am
Languages: English (N), German (B2/C1)
x 79

Re: Open AI - Whisper Translation Tool

Postby TopDog_IK » Sat Oct 22, 2022 5:33 pm

If anyone is interested, you can try out Whisper on youtube videos here: https://huggingface.co/spaces/jeffistyp ... -Whisperer
0 x

User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Open AI - Whisper Translation Tool

Postby tommus » Sun Oct 23, 2022 7:08 pm

I am totally amazed how well Whisper works for Dutch speech-to-text. It works much much better than Google Translate, for example. It works great even with lots of music in the background. It works great switching between different speakers in a conversation which is something GT does poorly. Most impressively, it puts in great punctuation, with periods at the end of sentences, capital letters starting a sentence. Capital letters on proper nouns. Correct letters and numbers. Accurate colloquial speech. Very fast speech. And it produces excellent subtitles that not only are correct but are timed perfectly. Before seeing Whisper in action, I would have thought this level of performance would not have been possible.

I am running Whisper on a Windows 11 desktop that has a solid state drive, with a Nvidia CUDA GPU. It does TTS at about twice talking speed. I had some trouble installing it and found what worked for me was to install Anaconda first and run the Anaconda command prompt using "Run as Administrator", and then use Python and add PyTorch and ffmpeg, as described on this web page:

How to Run OpenAI’s Whisper Speech Recognition Model

Thanks so much to the developers of Whisper. This is going to provide a very significant boost to my improvement in Dutch, particularly conversational and colloquial Dutch where it is difficult to find good quality audio with matching text. Again, from what I have seen from Whisper so far, it is quite amazing.
5 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000


Return to “General Language Discussion”

Who is online

Users browsing this forum: nathancrow77 and 2 guests