Back to articles

Google's New TTS Can Act, But Can It Handle 5 Minutes?

Apr 15, 2026Dishant Sharma6 min read
Google's New TTS Can Act, But Can It Handle 5 Minutes?

Google's New TTS Model Can Act. But Can It Read a 5-Minute Script Without Breaking?

A Reddit user on r/Bard put it plainly: "i generate 4-5 minute audio files and the new 3.1 Flash TTS gets extremely distorted after a minute or two." He compared it directly to 2.5 Flash TTS. Same problem. Tinny, echo-y, and sped up towards the end.

The older 2.5 Pro TTS, according to him, handles the same content without any of these issues.

Google announced Gemini 3.1 Flash TTS on April 15, 2026. A new text-to-speech model. 70+ languages. Natural-language audio tags to control style, pacing, and delivery. Multi-speaker dialogue support. SynthID watermarking baked in.

On paper, it sounds like a serious step forward. In practice, people who actually push TTS to its limits have a different story.


What Gemini 3.1 Flash TTS Actually Does

Here's what Google shipped:

  • Audio tags that let you direct vocal style, tone, pace, and accent using plain text instructions
  • 70+ languages with accent and dialect control built in
  • Multi-speaker dialogue handled natively, no separate API calls per voice
  • SynthID watermarking on all generated audio, imperceptible to listeners

The model scored 1,211 on the Artificial Analysis TTS Elo leaderboard. That is Google's best number yet. It is available in preview through the Gemini API, Google AI Studio, and Vertex AI. Workspace users get it through Google Vids.

The big selling point is the audio tags. Instead of fiddling with static configs or slider bars, you write things like "speak softly, then build intensity" and the model follows along. Think of it like stage directions for an AI narrator. Scene-level control. Speaker-level control.

Google's own documentation shows examples like directing different characters in a podcast to have different delivery styles, all within a single generation call. No stitching files together. No mismatched pacing between speakers.


The Hype vs The Reality

Google's blog post calls this "the next generation of expressive AI speech." The MarkTechPost headline went with "A New Benchmark in Expressive and Controllable AI Voice." Robert Scoble posted about SynthID watermarking to his 5.7 million followers. HyperAI covered it as "expressive AI speech delivery."

On Reddit, the reception was more muted.

The top post on r/Bard about the launch got 26 upvotes and 7 comments. One commenter said: "i seriously dont see difference in this since 2.5 flash live, the quality is the same and they dont give other options like cloning."

That word, "cloning," keeps coming up. Google is not offering voice cloning with this model. Other TTS providers like ElevenLabs and PlayHT have had voice cloning for a while now. For people building products that need a specific voice, not having clone support is a real gap.

"i generate 4-5 minute audio files and the new 3.1 Flash TTS gets extremely distorted after a minute or two."

The distortion problem is the most concrete complaint. If your use case is short clips, social media content, or voice agent responses, this might not matter. But if you need anything over two minutes, the model apparently starts falling apart.

Here is what i find interesting though. The same person said 2.5 Pro TTS does not have this issue. So Google knows how to build stable long-form TTS. The Flash tier, designed for speed and cost, seems to be the one cutting corners on output quality at longer durations.

That is a pattern. Google's Flash models are fast and cheap but always leave something on the table compared to Pro. Flash TTS is no different.

Flash is the cheap option. Pro is the expensive one. Developers naturally start with Flash, test it, and hit these walls. Then they have to decide whether to eat the cost of Pro or switch to a completely different provider. That friction adds up.


The Naming Problem Nobody Talks About

Someone on r/GoogleGeminiAI posted about how hard it is to keep track of Google's model lineup now. 3.1, 2.5, Flash, Flash-Lite, Live, TTS, image models. The naming alone is exhausting.

The post got 3 upvotes and 2 comments. One person wrote: "Gemini makes more sense when you stop viewing it as one clean ladder and start viewing it as a broader family of tools."

That is polite. What they really mean is: Google has too many models and the names do not help you understand what each one does.

"Flash" used to mean "fast and cheap." Now there is Flash, Flash Live, and Flash TTS. Three different things with overlapping names. "Live" is a speech-to-speech model for real-time conversations. "TTS" is text-to-speech. "Flash" by itself is the base model. All three exist at version 3.1 right now.

If you are a developer picking a model for the first time, good luck figuring out which one you need from the name alone. You will spend 20 minutes reading docs just to understand the difference.

Compare that to ElevenLabs. They have "Multilingual v2" and "Turbo v2." Two models. Clear names. You know immediately what you are getting.

i tried explaining this to a friend who works at a startup. He just wanted to add TTS to his app. After 30 minutes of reading Google's model docs, he went with ElevenLabs instead. Not because Google was worse. Because he could not figure out which Google model to use.


Who Should Actually Care

If you are building a voice agent that responds in short bursts, Flash TTS is worth trying. The audio tags are genuinely useful for that. Telling an AI customer service bot to "sound empathetic" or "speak calmly" through a text tag is simpler than what most teams are doing today.

Companies like Verizon and Home Depot are already testing Gemini voice models in their workflows, according to Google's own blog post from the Flash Live launch. The use case is clear: enterprise voice agents that need to handle short interactions at scale.

If you are generating podcasts, audiobooks, or any content over two minutes, wait. The distortion issues need to get fixed first. And honestly, 2.5 Pro TTS is reportedly more stable for long-form content right now.

If you need voice cloning, this is not the tool. Go to ElevenLabs or PlayHT. That capability gap is real and Google shows no sign of closing it with this release.


The Bigger Picture

Google shipped something technically impressive with Gemini 3.1 Flash TTS. The benchmark numbers are real. The audio tag system is a good idea that makes TTS more controllable. Multi-speaker dialogue is genuinely useful for anyone building conversational audio products.

SynthID watermarking is the quiet feature that actually matters most at scale. As AI-generated audio gets harder to distinguish from real speech, having imperceptible watermarks is not optional. It is a requirement. Google gets that right.

But the person on r/Bard generating 5-minute audio files is the one who matters here. Not the leaderboard score. Not the blog post title. The person who tried to use it for real work and found out it distorts after a minute.

Google will probably fix that. They usually do. But right now, on day one, the gap between the announcement and the user experience is exactly what you would expect from a preview.

And the voice cloning thing is not going away. Every week that passes without it is a week where ElevenLabs and the open-source TTS community pull further ahead on the feature that creators actually care about.

Audio tags are nice. But knowing who is speaking matters more.

Recent posts

View all posts