The Live API enables low-latency, real-time voice and video interactions with Gemini. It processes continuous streams of audio, video, or text to deliver immediate, human-like spoken responses. This creates a natural conversational experience for your users.
Key features
The Live API offers a comprehensive set of features for building robust voice agents:
- Native audio: Provides natural, realistic-sounding speech and improved multilingual performance.
- Multilingual support: Converse in 24 supported languages.
- Voice activity detection (VAD): Automatically handles interruptions and turn-taking.
- Affective dialog: Adapts response style and tone to match the user's input expression.
- Proactive audio: Lets you control when the model responds and in what contexts.
- Thinking: Uses hidden reasoning tokens to "think" before speaking for complex queries.
- Tool use: Integrates tools like function calling and Google Search for dynamic interactions.
- Audio transcriptions: Provides text transcripts of both user input and model output.
- Speech-to-speech translation: Optimized for low-latency translation between languages.
Technical specifications
The following table outlines the technical specifications for the Live API:
| Category | Details |
|---|---|
| Input modalities | Audio (PCM 16kHz), video (1FPS), text |
| Output modalities | Audio (PCM 24kHz), text |
| Protocol | Stateful WebSocket connection (WSS) |
| Latency | Real-time streaming for immediate feedback |
Supported models
The following models support the Live API. Select the appropriate model based on your interaction requirements.
| Model ID | Availability | Use case | Key features |
|---|---|---|---|
gemini-live-2.5-flash-preview-native-audio-09-2025 |
Public preview | Cost-efficiency in real-time voice agents. |
Native audio Audio transcriptions Voice activity detection Affective dialog Proactive audio Tool use |
gemini-2.5-flash-s2st-exp-11-2025 |
Public experimental | Speech-to-Speech Translation (experimental). Optimized for translation tasks. |
Native audio Audio transcriptions Tool use Speech-to-speech translation |
Architecture and integration
There are two primary ways to integrate the Live API into your application: server-to-server and client-to-server. Choose the one that fits your security and platform requirements.
Server-to-server
Server-to-server architecture is recommended for production environments such as mobile apps, secure enterprise tools, and telephony integration. Your client application streams audio to your secure backend server. Your server then manages the WebSocket connection to Google.
This method keeps your API keys secure and lets you modify audio or add logic before sending it to Gemini. However, it adds a small amount of network latency.
Client-to-server
Client-to-server architecture is suitable for web apps, quick demos, and internal tools. The web browser connects directly to the Live API using WebSockets.
This method provides the lowest possible latency and a simpler architecture for demos. Be aware that this approach exposes API keys to the frontend user, which creates a security risk. For production, you must use careful proxying or ephemeral token management.
Get started
Select the guide that matches your development environment:
Gen AI SDK tutorial
Connect to the Live API using the Gen AI SDK, and send an audio file to Gemini and receive audio in response.
WebSocket tutorial
Connect to the Live API using WebSockets, and send an audio file to Gemini and receive audio in response.
ADK tutorial
Create an agent and use the Agent Development Kit (ADK) Streaming to enable voice and video communication.
Run a demo web app
Set up and run a web application that enables you to use your voice and camera to talk to Gemini through the Live API.
Partner integrations
If you prefer a simpler development process, you can use Daily, LiveKit or Voximplant. These are third-party partner platforms that have already integrated the Gemini Live API over the WebRTC protocol to streamline the development of real-time audio and video applications.
