Google expands Gemini 2.5 with native voice and TTS tools

At its I/O event, Google unveiled Gemini 2.5, an AI model with cutting-edge audio dialogue and generation capabilities. These enhancements aim to deliver seamless voice interactions across various products and languages globally.

Google has integrated Gemini 2.5 into applications like NotebookLM’s Audio Overviews and Project Astra. The model prioritizes real-time audio conversations, enabling AI to interpret and produce speech with natural tone, style, and contextual awareness.

Gemini 2.5 Native Audio Dialogue Features

Fluent and Natural Interaction: Provides low-latency voice exchanges with natural rhythm and appropriate emotional expression.
Speech Customization: Users can modify speech delivery using natural language prompts, adjusting accents, tones, or even enabling whispered output.
External Tool Integration: Incorporates real-time data from tools like Google Search or custom developer solutions during conversations.
Environmental Filtering: Distinguishes relevant speech from background noise or irrelevant audio, responding only when appropriate.
Multimedia Comprehension: Analyzes and discusses content from live video feeds or shared screens.
Language Flexibility: Supports over 24 languages, allowing seamless blending of multiple languages within a single interaction.
Emotion-Responsive Dialogue: Adapts responses based on the user’s vocal tone, recognizing nuances in speech delivery.
Enhanced Reasoning: Leverages improved logical capabilities for more coherent and intelligent conversations, especially in complex tasks.

Text-to-Speech (TTS) Customization

Gemini 2.5 offers advanced control over audio generation, allowing users to tailor speech output with precision:

Engaging narration for poetry, broadcasts, or stories, with options for varied emotions and accents.
Customizable speech tempo and accurate pronunciation adjustments for improved audio clarity.
Creation of dual-speaker dialogues, such as conversational summaries for enhanced engagement.
Seamless production of audio in over 24 languages for multilingual content.

Developer Options

Google provides two Gemini 2.5 configurations for audio development:

Gemini 2.5 Pro Preview: Crafted for detailed, high-fidelity audio output, ideal for sophisticated projects.
Gemini 2.5 Flash Preview: Engineered for quick, budget-conscious audio production for everyday applications.

These configurations facilitate audio creation for applications such as podcasts, video games, and public announcements.

Safety and Transparency

Google conducted comprehensive risk evaluations during the development of Gemini 2.5’s audio features. Safety measures were refined through internal and external testing, including red teaming. All AI-generated audio includes SynthID, Google’s watermarking technology, to clearly identify AI-produced content.

Access for Developers

Google enables developers to utilize Gemini 2.5’s audio capabilities via the Gemini API, accessible through Google AI Studio and Vertex AI environments.

Interactive Voice Testing: Developers can experiment with real-time audio conversations using Gemini 2.5 Flash within the stream tab of Google AI Studio.
Speech Creation Tools: Both Gemini 2.5 Pro and Flash support audio generation, available through the media generation tab in Google AI Studio.