Gemini 3.1 Flash TTS

Simon Willison's Blog Models

Summary

Google released Gemini 3.1 Flash TTS, a new text-to-speech model accessible via the Gemini API that supports advanced prompt-based control for detailed voice direction, accents, and speaking styles. The model enables sophisticated audio generation including multi-speaker conversations and character-specific vocal performances.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:27 AM

# Gemini 3.1 Flash TTS Source: https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts/ 15th April 2026 \- Link Blog **Gemini 3.1 Flash TTS (https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/)**. Google released Gemini 3.1 Flash TTS today, a new text-to-speech model that can be directed using prompts. It's presented via the standard Gemini API using `gemini-3.1-flash-tts-preview` as the model ID, but can only output audio files. The [prompting guide](https://ai.google.dev/gemini-api/docs/speech-generation#transcript-tags) is surprising, to say the least. Here's their example prompt to generate just a few short sentences of audio: ``` # AUDIO PROFILE: Jaz R. ## "The Morning Hype" ## THE SCENE: The London Studio It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation. ### DIRECTOR'S NOTES Style: * The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting. * Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning"). Pace: Speaks at an energetic pace, keeping up with the fast music. Speaks with A "bouncing" cadence. High-speed delivery with fluid transitions — no dead air, no gaps. Accent: Jaz is from Brixton, London ### SAMPLE CONTEXT Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy. #### TRANSCRIPT [excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you. [shouting] Turn this up! We've got the project roadmap landing in three, two... let's go! ``` Here's what I got using that example prompt: Your browser does not support the audio element. Then I modified it to say "Jaz is from Newcastle" and "... requires a charismatic Newcastle accent" and got this result: Your browser does not support the audio element. Here's Exeter, Devon for good measure: Your browser does not support the audio element. I [had Gemini 3.1 Pro](https://gemini.google.com/share/dd0fba5a83c4) vibe code this UI for trying it out: ![Screenshot of a "Gemini 3.1 Flash TTS" web application interface. At the top is an "API Key" field with a masked password. Below is a "TTS Mode" section with a dropdown set to "Multi-Speaker (Conversation)". "Speaker 1 Name" is set to "Joe" with "Speaker 1 Voice" set to "Puck (Upbeat)". "Speaker 2 Name" is set to "Jane" with "Speaker 2 Voice" set to "Kore (Firm)". Under "Script / Prompt" is a tip reading "Tip: Format your text as a script using the Exact Speaker Names defined above." The script text area contains "TTS the following conversation between Joe and Jane:\n\nJoe: How's it going today Jane?\nJane: [yawn] Not too bad, how about you?" A blue "Generate Audio" button is below. At the bottom is a "Success!" message with an audio player showing 00:00 / 00:06 and a "Download WAV" link.](https://static.simonwillison.net/static/2026/gemini-flash-tts.jpg)

Similar Articles

Improved Gemini audio models for powerful voice experiences

Google DeepMind Blog

Google has updated Gemini 2.5 Flash Native Audio to improve live voice agent capabilities, including sharper function calling, better instruction following, and smoother conversation context retrieval. The update also introduces live speech translation in the Google Translate app beta, preserving intonation across 70+ languages.

Advanced audio dialog and generation with Gemini 2.5

Google DeepMind Blog

Google announces Gemini 2.5's advanced native audio capabilities, enabling real-time conversational AI with natural speech generation, style control, and multimodal understanding across 24+ languages.

Gemini 3 Flash: frontier intelligence built for speed

Google DeepMind Blog

Google has released Gemini 3 Flash, a fast, cost-effective AI model that combines Pro-grade reasoning with Flash-level speed for tasks like coding, complex analysis, and agentic workflows.