Configure a session
For WebSocket, the connection starts with asession.created event. For WebRTC, send session.update as soon as the data channel opens. In both cases, use session.update to configure your session. Here you can set:
model— LLM provider and model (e.g.openai/gpt-4.1-nano) or router (e.g.inworld/latency-optimizer-ab-test)instructionsoutput_modalities(["audio", "text"],["audio"], or["text"])- Audio input and output configuration — voice, TTS model, PCM format, speed
max_output_tokens("inf"or a numeric ceiling)tools(function definitions) andtool_choicesettingsproviderData— Inworld extensions for STT, TTS, memory, back-channel, and responsiveness (see Inworld Realtime API Extensions)
STT (Speech-to-Text)
Choose an STT model
Setaudio.input.transcription.model to select the speech-to-text model used to transcribe user audio. inworld/inworld-stt-1 is the recommended default for most realtime voice agents; pick a third-party model when its specific strength (sub-300ms latency, semantic end-of-turn, etc.) matters for your use case.
| Model | Best for |
|---|---|
inworld/inworld-stt-1 | Inworld’s first-party STT with configurable turn-taking controls. Recommended default. |
assemblyai/u3-rt-pro | High-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese) |
assemblyai/universal-streaming-multilingual | Multilingual streaming across the same six languages |
assemblyai/universal-streaming-english | English-optimized streaming |
soniox/stt-rt-v4 | High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support |
error event (type: "invalid_request_error", code: "invalid_value", param: "session.audio.input.transcription.model") and the rest of the session.update is not applied. See STT Introduction for the full model catalogue and comparison.
Transcription hints
Guide the STT decoder with a prompt (vocabulary, domain context, formatting preferences). This is the OpenAI-standardaudio.input.transcription.prompt field and is portable across OpenAI-compatible SDKs:
| Field | Type | Description |
|---|---|---|
model | string | STT model ID. See Choose an STT model. |
prompt | string | Transcription guidance: vocabulary hints, domain context, formatting preferences. |
language | string | BCP-47 language code (e.g. "en", "es"). Optional; the model auto-detects when omitted. |
Tune turn detection
Turn detection — when the server decides a user has finished speaking — is controlled by the OpenAI-standardaudio.input.turn_detection object. The Realtime API supports both VAD types and is wire-compatible with the OpenAI SDK.
semantic_vad
Model-based end-of-turn detection backed by the STT stream. eagerness is the primary tuning knob.
| Field | Type | Description |
|---|---|---|
type | string | "semantic_vad" (default) |
eagerness | string | How aggressively to end turns: "low", "medium", "high", "auto". Lower eagerness requires stronger end-of-turn confidence; higher eagerness commits to end-of-turn sooner. |
create_response | boolean | Auto-create a response on turn end (default true) |
interrupt_response | boolean | Interrupt the active response when the user speaks (default true) |
eagerness maps to a full set of four STT turn-detection parameters — confidence threshold, VAD threshold, minimum end-of-turn silence, and maximum within-turn silence. Lower thresholds and shorter silences mean the STT model commits to end-of-turn sooner (more eager).
eagerness | end_of_turn_confidence_threshold | vad_threshold | min_end_of_turn_silence (ms) | max_turn_silence (ms) |
|---|---|---|---|---|
low | 0.85 | 0.5 | 400 | 3000 |
medium | 0.70 | 0.5 | 160 | 2400 |
auto | 0.70 | 0.5 | 160 | 2400 |
high | 0.55 | 0.3 | 80 | 1200 |
auto mirrors medium until router-side adaptive logic exists. Any explicit field under providerData.stt (see STT extensions below) overrides the eagerness-derived default for that field — fields you do not set keep the eagerness mapping.
server_vad
Inworld-hosted Silero VAD + Smart Turn detector. Tunable fields match OpenAI’s server_vad shape and can be changed mid-session via partial session.update.
| Field | Type | Description |
|---|---|---|
type | string | "server_vad" |
threshold | number | Silero VAD speech cutoff, 0.0–1.0. Default 0.5. |
prefix_padding_ms | integer | Pre-speech audio retained before an utterance, in ms. Default 200. |
silence_duration_ms | integer | Trailing silence required to finalize the turn, in ms. Default 1000. |
idle_timeout_ms | integer | null | When set, the server emits input_audio_buffer.timeout_triggered after this many ms with no detected speech. null or 0 disables. |
create_response | boolean | Auto-create a response on turn end (default true) |
interrupt_response | boolean | Interrupt the active response when the user speaks (default true) |
session.update — omit a field to keep its current value. Changes take effect on the next audio chunk processed.
See Voice Activity Detection (VAD) for the VAD event lifecycle.
Audio input formats
Set the wire format for client → server audio underaudio.input.format. Four formats are supported; pick based on your source. The same catalog applies to audio.output.format (covered under TTS below).
type | Encoding | Sample rate | When to use |
|---|---|---|---|
audio/pcm | Signed 16-bit little-endian PCM | rate (default 24000) | Default for browser, mobile, and most server-side sources. Send mono. |
audio/pcmu | G.711 μ-law | Fixed 8000 Hz (ignore rate) | Telephony (Twilio Media Streams, SIP trunks in North America/Japan). |
audio/pcma | G.711 A-law | Fixed 8000 Hz (ignore rate) | Telephony (SIP trunks in Europe and most of the rest of the world). |
audio/float32 | 32-bit float PCM, little-endian | rate (default 24000) | Pipelines that natively produce float32 samples (some audio frameworks). |
input_audio_buffer.append).
format as a bare string — "pcm16", "g711_ulaw", "g711_alaw", or "float32" — and the server expands it to the object form above.
The server resamples to 16 kHz internally for STT, so PCM input rate doesn’t need to match the STT model’s native rate. Send any rate that’s convenient; 24000 and 8000 (G.711) are the common choices.
See Telephony with Twilio for a worked example of the G.711 path.
Send audio input
There are two ways to send audio input: Method 1: Streaming Audio (Real-time) Useinput_audio_buffer.* events for streaming real-time audio from a microphone:
- Encode microphone data in your chosen input format (PCM16 at 24 kHz is the default).
- Send chunks via
input_audio_buffer.append. - VAD automatically detects speech boundaries and commits the buffer.
conversation.item.create with input_audio content type for pre-recorded audio chunks:
STT extensions
Inworld extensions for STT live underproviderData.stt — voice profile signals, language hints (Soniox), and explicit overrides for the four turn-detection parameters that semantic_vad.eagerness controls implicitly. Full field reference and the voice-profile payload shape are in providerData.stt.
LLM
Choose a router or LLM
Setmodel in session.update to select which Router or LLM handles the conversation. The format is provider/modelName or inworld/routerId:
model, the default model (google-ai-studio/gemini-2.5-flash) is used. You can change the model mid-session with a partial update — the new model takes effect on the next response.
Send text input
Create explicit conversation items for text turns:Function calling
The Realtime API supports function calling so your agent can fetch live data or trigger actions mid-conversation. Define functions insession.tools, then handle calls as they arrive.
1. Register a tool
2. Handle the function call
When the model decides to call a function, you receive aresponse.function_call_arguments.done event with the call_id, function name, and serialized arguments. Execute your logic, then return the result:
3. What happens next
Afterresponse.create, the model incorporates the function output and continues the conversation — speaking the horoscope aloud (if output_modalities includes audio) or streaming text deltas. The user hears the answer without any gap in the conversation flow.
You can register multiple tools and the model will call them as needed. Each call arrives as a separate response.function_call_arguments.done event with its own call_id.
Memory
Inworld’s automatic conversation memory layer extracts durable facts and a rolling summary, prepends them to the system prompt, and trims older transcript items so context stays bounded. Configured underproviderData.memory. See providerData.memory for the field reference, and Long-term Memory for the cross-session persistence pattern.
TTS (Text-to-Speech)
Choose a TTS model
Setaudio.output.model to select the text-to-speech model:
| Model | Size | Notes |
|---|---|---|
inworld-tts-2 | 8B | Higher quality audio. Recommended for most agents. Required for providerData.tts.conversational and the CREATIVE delivery mode. |
inworld-tts-1.5-mini | 1B | Faster inference, lower latency. Server default when audio.output.model is omitted. |
inworld-tts-2 for quality; switch to inworld-tts-1.5-mini if you’re optimizing for raw latency or running at high concurrency. You can change the TTS model mid-session alongside voice or independently.
Choose a voice
Setaudio.output.voice to control the agent’s speaking voice:
Dennis. Browse available voices in the TTS Playground or list them programmatically with the List Voices API.
Audio output format
Set the wire format for server → client audio underaudio.output.format. The catalog is identical to Audio input formats above — audio/pcm, audio/pcmu, audio/pcma, or audio/float32. Default is PCM16 at 24 kHz.
audio.output.format.rate you request, so any reasonable rate is accepted.
TTS extensions
Inworld extensions for TTS live underproviderData.tts — segmentation strategy, steering handling, synthesis language, the TTS-2 delivery preset, and (for TTS-2) conversational mode that preserves a shared upstream context across turns. Full field reference, segmenter strategy table, and conversational-mode details are in providerData.tts.
Managing the session
Conversation state
Use conversation events to keep context lean:conversation.item.retrieve: pull any prior item by ID.conversation.item.delete: remove items that should not remain in context.
max_output_tokens and response.cancel to control overall cost (conversation management guide).
Observing usage
response.done carries a response.usage block on every response — including cancelled responses (barge-in, supersede). The base fields (total_tokens, input_tokens, output_tokens, plus input_token_details / output_token_details) cover LLM accounting, and three optional sub-objects attribute usage per modality:
| Field | Type | Description |
|---|---|---|
usage.llm.model | string | Effective upstream LLM after router resolution. Useful when you sent inworld/auto and want to see which model the router picked. |
usage.tts.model | string | TTS model used (e.g. inworld-tts-2). |
usage.tts.characters | integer | Characters synthesized across all TTS segments of this response. |
usage.tts.audio_seconds | number | Assistant audio duration emitted by TTS, in seconds. The canonical TTS billing signal. |
usage.stt.model | string | STT model used (e.g. soniox/stt-rt-v4). |
usage.stt.audio_seconds | number | User audio duration transcribed for this turn, in seconds. Drained per response.done from a rolling per-session counter — each response sees only the user audio that arrived since the previous response.done. |
stt).
input_token_details / output_token_details breakdowns, see the response.done event.
Monitor errors
Handleerror events (with type, code, and param) and implement a reconnection/backoff strategy for transient failures. See the API reference for error event schemas.