Configure a Session
Every connection starts withsession.created. Immediately follow with session.update to configure your session. Here you can set:
modelId— LLM provider and model (e.g.openai/gpt-4.1-nano) or router (e.g.inworld/latency-optimizer-ab-test)instructionsoutput_modalities(["audio", "text"],["audio"], or["text"])- Audio input and output configuration — voice, TTS model, PCM format, speed
max_output_tokens("inf"or a numeric ceiling)tools(function definitions) andtool_choicesettings
Choose a Router or LLM
SetmodelId in session.update to select which SmartRouter or LLM handles the conversation. The format is provider/modelName or inworld/routerId:
modelId, the default model (google-ai-studio/gemini-2.5-flash) is used. You can change the model mid-session with a partial update — the new model takes effect on the next response.
Choose a Voice
Setaudio.output.voice to control the agent’s speaking voice:
Dennis. Browse available voices in the TTS Playground or list them programmatically with the List Voices API.
Choose a TTS Model
Setaudio.output.model to select the text-to-speech model:
| Model | Size | Notes |
|---|---|---|
inworld-tts-1.5-mini | 1B | Faster inference, lower latency (default) |
inworld-tts-1.5-max | 8B | Higher quality audio |
inworld-tts-1.5-mini. You can change the TTS model mid-session alongside voice or independently.
Send Input
Audio
There are two ways to send audio input: Method 1: Streaming Audio (Real-time) Useinput_audio_buffer.* events for streaming real-time audio from a microphone:
- Convert microphone data to PCM16, 24 kHz, mono.
- Send chunks via
input_audio_buffer.append. - VAD automatically detects speech boundaries and commits the buffer.
conversation.item.create with input_audio content type for pre-recorded audio chunks:
Text
Create explicit conversation items for text turns:Function Calling
The Realtime API supports function calling so your agent can fetch live data or trigger actions mid-conversation. Define functions insession.tools, then handle calls as they arrive.
1. Register a tool
2. Handle the function call
When the model decides to call a function, you receive aresponse.function_call_arguments.done event with the call_id, function name, and serialized arguments. Execute your logic, then return the result:
3. What happens next
Afterresponse.create, the model incorporates the function output and continues the conversation — speaking the horoscope aloud (if output_modalities includes audio) or streaming text deltas. The user hears the answer without any gap in the conversation flow.
You can register multiple tools and the model will call them as needed. Each call arrives as a separate response.function_call_arguments.done event with its own call_id.
Manage Conversation State
Use conversation events to keep context lean:conversation.item.retrieve: pull any prior item by ID.conversation.item.delete: remove items that should not remain in context.conversation.item.truncate: drop everything before a cutoff to manage token usage.
max_output_tokens and response.cancel to control overall cost (conversation management guide).
Monitor Errors
Handleerror events (with type, code, and param) and implement a reconnection/backoff strategy for transient failures. See the API reference for error event schemas.