Google rolls out Gemini 3.1 Flash TTS with better speech quality and developer controls

Google has introduced Gemini 3.1 Flash TTS, a new text-to-speech model focused on improving speech quality, controllability, and scalability. The model is part of the Gemini 3.1 Flash Audio family and is designed for developers, enterprises, and general users building AI-based speech applications.

Gemini 3.1 Flash Audio belongs to the Gemini series of multimodal models, supporting audio alongside other modalities such as text, images, and video.

Gemini 3.1 Flash TTS

Gemini 3.1 Flash TTS is based on Gemini 3 Pro and is designed specifically for generating speech from text inputs.

  • Gemini 3.1 Flash TTS
    • Input: Text up to 16K tokens
    • Output: Audio up to 32K tokens
  • Gemini 3.1 Flash Live
    • Inputs: Audio, images, video, and text up to 128K tokens
    • Outputs: Audio and text up to 64K tokens

These configurations enable both standalone text-to-speech generation and multimodal interactions through the Flash Live variant.

Key features
  • Improved speech quality: Generates more natural and expressive speech output; achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard
  • Cost and performance balance: Positioned in Artificial Analysis’ “most attractive quadrant,” indicating a balance between speech quality and cost
  • Audio tags for control: Allows users to guide tone, pacing, and delivery using natural language instructions embedded within text
  • Multi-speaker support: Supports dialogue between multiple speakers with distinct voice characteristics
  • Scene direction: Enables definition of context and interaction style to maintain consistent character behavior
  • Speaker-level controls: Supports assignment of audio profiles and adjustment of tone, accent, and pace
  • Inline expression changes: Allows voice style adjustments within a sentence using embedded tags
  • Developer control tools: Provides configurable controls in Google AI Studio, enabling detailed direction over speech output
  • Seamless API export: Allows configured voice settings to be exported as Gemini API code for reuse across applications
  • Multilingual support: Supports speech generation across more than 70 languages with localized control
  • Global scalability: Designed for deployment across different regions and use cases

Early developers and enterprise users report improved controllability and expressive output, particularly with audio tags

Safety

Audio generated using Gemini 3.1 Flash TTS includes SynthID watermarking, an embedded identifier that enables detection of AI-generated content and supports measures against misuse.

Availability

Gemini 3.1 Flash TTS is rolling out in preview:

  • Developers: Available via Gemini API and Google AI Studio
  • Enterprises: Available on Vertex AI
  • Workspace users: Available through Google Vids


Related Post