OpenAI has introduced three new audio models in its API that enable developers to build a new class of voice applications. These models are designed to make voice interactions more natural, context-aware, and capable of taking action in real time.
The three models—GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper—move voice systems beyond simple call-and-response into continuous, agent-like interactions that can listen, reason, translate, transcribe, and act during conversations.
New realtime audio models
GPT-Realtime-2
GPT-Realtime-2 is OpenAI’s first voice model with GPT-5-class reasoning designed for live conversational use cases. It supports complex interactions where the model can think, respond, and use tools while the conversation continues.
It is built for situations where responses, actions, and reasoning must happen together without interrupting the flow of speech.
Key capabilities
- Handles complex, multi-step voice requests in real time
- Maintains continuous conversational flow with contextual reasoning
- Uses tools during live conversations without breaking interaction
- Supports spoken preambles such as “let me check that” or “one moment while I look into it”
- Executes parallel tool calls with audible transparency (e.g., “checking your calendar”)
- Improves recovery behavior with natural fallback responses instead of silent failure
- Expands context window from 32K to 128K for longer sessions
- Better handling of domain-specific vocabulary, proper nouns, and technical terms
- Supports adjustable tone (calm, empathetic, or upbeat based on context)
- Adjustable reasoning levels: minimal, low, medium, high, xhigh (low is default)
Performance improvements
- +15.2% on Big Bench Audio (audio intelligence) vs GPT-Realtime-1.5 (high setting)
- +13.8% on Audio MultiChallenge (instruction following) in xhigh setting
GPT-Realtime-Translate
GPT-Realtime-Translate enables real-time multilingual voice communication where speech is translated instantly while preserving meaning and pacing. It also supports live transcription alongside translation.
It is designed to maintain accuracy even in natural speech conditions such as interruptions, accent variations, or context switching.
Key capabilities
- Supports 70+ input languages
- Outputs in 13 languages
- Real-time speech translation with preserved meaning and timing
- Live transcription alongside translated output
- Handles accents, regional pronunciation, and domain-specific language
- Maintains fluency during natural or interrupted speech
Use cases
- Customer support across languages
- Education and classrooms
- Cross-border communication and sales
- Media, events, and creator platforms
For example, Deutsche Telekom is testing real-time multilingual voice interactions where users speak different languages while the system translates conversations instantly with low latency.
GPT-Realtime-Whisper
GPT-Realtime-Whisper is a streaming speech-to-text model designed for low-latency transcription. It converts spoken audio into text as it is being spoken, enabling real-time understanding and interaction.
It supports continuous transcription, making voice data usable immediately in workflows.
Key capabilities
- Real-time transcription during speech
- Low-latency streaming captions
- Continuous understanding of live conversations
- Designed for responsive voice applications
Use cases
- Meeting captions and notes
- Classrooms and education tools
- Live broadcasts and events
- Customer support workflows
- Healthcare, recruiting, and sales systems
- Real-time summaries and follow-up generation during conversations
Voice as a software interface
OpenAI highlights voice as one of the most natural ways to interact with software. It allows users to complete tasks without typing, such as getting help while driving, changing travel plans on the move, or receiving support in their preferred language.
However, effective voice systems require more than fast responses. A capable voice agent must:
- Understand intent and maintain context
- Adapt to changing requests during conversation
- Use tools while continuing dialogue
- Recover smoothly from interruptions or failures
- Respond appropriately based on tone and situation
Together, these models move voice AI from simple interaction into systems that can complete tasks in real time while conversations are ongoing.
Emerging voice AI patterns
OpenAI identifies three key patterns shaping voice applications:
Voice-to-action
Users describe tasks, and the system executes them using reasoning and tools.
- Example: Zillow-style assistants that can find homes, apply constraints, and schedule tours.
Systems-to-voice
Applications turn real-time context into spoken guidance.
- Example: Travel systems that provide updates on flight delays, gate changes, fastest routes, and baggage status.
Voice-to-voice
AI enables real-time multilingual conversations across users and contexts.
- Example: Deutsche Telekom-style systems that translate speech live during conversations.
These patterns can also combine. Priceline is working toward full trip management through voice, including flight search, hotel changes, delay handling, TSA updates, and translation during travel.
Safety and safeguards
The Realtime API includes multiple layers of safety and compliance protections:
- Active classifiers monitor live sessions and can stop conversations that violate safety rules
- Developers can add additional safeguards using the Agents SDK
- Policies prohibit spam, deception, and harmful redistribution of outputs
- Developers must disclose when users are interacting with AI unless clearly obvious
- Supports EU Data Residency for regional compliance needs
- Covered under enterprise privacy commitments
Pricing and availability
All three models are available in the Realtime API:
GPT-Realtime-2
- $32 per 1M audio input tokens
- $0.40 per 1M cached input tokens
- $64 per 1M audio output tokens
GPT-Realtime-Translate
- $0.034 per minute
GPT-Realtime-Whisper
- $0.017 per minute
Getting started
Developers can test the models in the OpenAI Playground. They can also integrate GPT-Realtime-2 into applications using Codex or start building new realtime voice applications from scratch.