Introducing Whisper

OpenAI Blog Models

Summary

OpenAI introduces Whisper, an end-to-end encoder-decoder Transformer model trained on large-scale diverse audio data for robust multilingual speech recognition, language identification, and speech-to-English translation. Whisper achieves 50% fewer errors than specialized models on diverse datasets and outperforms supervised benchmarks on speech translation despite not being fine-tuned to specific datasets.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:46 PM

# Introducing Whisper Source: [https://openai.com/index/whisper/](https://openai.com/index/whisper/) The Whisper architecture is a simple end\-to\-end approach, implemented as an encoder\-decoder Transformer\. Input audio is split into 30\-second chunks, converted into a log\-Mel spectrogram, and then passed into an encoder\. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase\-level timestamps, multilingual speech transcription, and to\-English speech translation\. Other existing approaches frequently use smaller, more closely paired audio\-text training datasets,[1](https://openai.com/index/whisper/#citation-bottom-1)[2](https://openai.com/index/whisper/#citation-bottom-2),[3](https://openai.com/index/whisper/#citation-bottom-3)or use broad but unsupervised audio pretraining\.[4](https://openai.com/index/whisper/#citation-bottom-4),[5](https://openai.com/index/whisper/#citation-bottom-5),[6](https://openai.com/index/whisper/#citation-bottom-6)Because Whisper was trained on a large and diverse dataset and was not fine\-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition\. However, when we measure Whisper’s zero\-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models\. About a third of Whisper’s audio dataset is non\-English, and it is alternately given the task of transcribing in the original language or translating to English\. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero\-shot\.

Similar Articles

vaibhavs10/incredibly-fast-whisper

Replicate Explore

A highly optimized version of OpenAI's Whisper Large v3 using Transformers, Optimum, and Flash Attention 2, capable of transcribing 150 minutes of audio in under 2 minutes on Replicate.

Introducing ChatGPT and Whisper APIs

OpenAI Blog

OpenAI released ChatGPT (GPT-3.5 Turbo) and Whisper APIs for developers, featuring 90% cost reduction since December and enabling integration into third-party applications. The announcement includes early adopter examples from Snap, Quizlet, Instacart, Shop, and Speak.

Advancing voice intelligence with new models in the API

OpenAI Blog

OpenAI has announced three new voice models in its API: GPT-Realtime-2 with advanced reasoning, GPT-Realtime-Translate for live multilingual translation, and GPT-Realtime-Whisper for streaming transcription, aiming to enable more natural and action-oriented voice applications.

Introducing next-generation audio models in the API

OpenAI Blog

OpenAI introduced next-generation audio models for the API, including improved speech-to-text (gpt-4o-transcribe, gpt-4o-mini-transcribe) and customizable text-to-speech models that enable developers to build more intelligent and expressive voice agents with enhanced accuracy across challenging scenarios.