Configure language and voice

This document describes how to configure synthesized speech responses and voice activity detection in Live API. You can configure responses in a variety of HD voices and languages, and also configure voice activity detection settings to allow users to interrupt the model.

Set the language and voice

To set the response language and voice, configure as follows:

Console

  1. Open Vertex AI Studio > Stream realtime.
  2. In the Outputs expander, select a voice from the Voice drop-down.
  3. In the same expander, select a language from the Language drop-down.
  4. Click Start session to start the session.

Python

from google.genai.types import LiveConnectConfig, SpeechConfig, VoiceConfig, PrebuiltVoiceConfig

config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=SpeechConfig(
        voice_config=VoiceConfig(
            prebuilt_voice_config=PrebuiltVoiceConfig(
                voice_name=voice_name,
            )
        ),
        language_code="en-US",
    ),
)
      

Voices supported

The Live API supports the following 30 voice options in the voice_name field:

Zephyr -- Bright
Kore -- Firm
Orus -- Firm
Autonoe -- Bright
Umbriel -- Easy-going
Erinome -- Clear
Laomedeia -- Upbeat
Schedar -- Even
Achird -- Friendly
Sadachbia -- Lively
Puck -- Upbeat
Fenrir -- Excitable
Aoede -- Breezy
Enceladus -- Breathy
Algieba -- Smooth
Algenib -- Gravelly
Achernar -- Soft
Gacrux -- Mature
Zubenelgenubi -- Casual
Sadaltager -- Knowledgeable
Charon -- Informative
Leda -- Youthful
Callirrhoe -- Easy-going
Iapetus -- Clear
Despina -- Smooth
Rasalgethi -- Informative
Alnilam -- Firm
Pulcherrima -- Forward
Vindemiatrix -- Gentle
Sulafat -- Warm

Languages supported

The Live API supports the following 24 languages:

Language BCP-47 Code Language BCP-47 Code
Arabic (Egyptian) ar-EG German (Germany) de-DE
English (US) en-US Spanish (US) es-US
French (France) fr-FR Hindi (India) hi-IN
Indonesian (Indonesia) id-ID Italian (Italy) it-IT
Japanese (Japan) ja-JP Korean (Korea) ko-KR
Portuguese (Brazil) pt-BR Russian (Russia) ru-RU
Dutch (Netherlands) nl-NL Polish (Poland) pl-PL
Thai (Thailand) th-TH Turkish (Turkey) tr-TR
Vietnamese (Vietnam) vi-VN Romanian (Romania) ro-RO
Ukrainian (Ukraine) uk-UA Bengali (Bangladesh) bn-BD
English (India) en-IN & hi-IN bundle Marathi (India) mr-IN
Tamil (India) ta-IN Telugu (India) te-IN

Configure voice activity detection

Voice activity detection (VAD) allows the model to recognize when a person is speaking. This is essential for creating natural conversations, because it allows a user to interrupt the model at any time.

When VAD detects an interruption, the ongoing generation is canceled and discarded. Only the information already sent to the client is retained in the session history. The server then sends a BidiGenerateContentServerContent message to report the interruption. The server then discards any pending function calls and sends a BidiGenerateContentServerContent message with the IDs of the canceled calls.

Python

config = {
    "response_modalities": ["audio"],
    "realtime_input_config": {
        "automatic_activity_detection": {
            "disabled": False, # default
            "start_of_speech_sensitivity": "low",
            "end_of_speech_sensitivity": "low",
            "prefix_padding_ms": 20,
            "silence_duration_ms": 100,
        }
    }
}
      

What's next