Inworld Realtime API Extensions - Inworld AI Documentation

The Inworld Realtime API is wire-compatible with the OpenAI Realtime spec — clients written against OpenAI’s session, audio, and response events work against Inworld unchanged. On top of that baseline, Inworld layers production-grade extensions that improve quality, latency, and conversational naturalness:

STT tuning — voice profile signals, language hints, explicit end-of-turn and VAD overrides
TTS segmentation, steering, and alignment — pick how the LLM token stream is chunked into TTS calls, the synthesis language, the TTS-2 delivery preset, (for TTS-2) a shared multi-turn context, and opt into word/character-level timestamp alignment for lip-sync or captions
Automatic conversation memory — periodic summarization and fact extraction that keep long sessions inside the context window
Back-channel — short interjections ("uh-huh", "I see") emitted while the user is still speaking, so the agent feels like an active listener
Responsiveness fillers — short filler audio ("let me think") spoken in the gap after a user turn if the main LLM is slow to produce its first delta

Everything Inworld adds beyond the OpenAI spec is exposed through a single field on the session object: providerData. Send it inside any session.update and the server merges it with current state. Most fields hot-swap mid-session and take effect on the next audio chunk or turn; the locked-at-session-open exceptions are called out in the hot-swap reference at the bottom of this page. This page is the field-by-field reference for the full providerData surface. For task-driven walkthroughs (language switching, conversation management, etc.) and for the event-handling client code that pairs with back-channel and responsiveness, see the linked guides under each branch.

Branch overview

providerData is a flat object with five branches. Each branch is independent — send only the ones you want to configure.

Branch	Purpose	Hot-swap
`stt`	STT tuning (voice profile, language hints, end-of-turn thresholds, VAD overrides)	Yes — STT stream is restarted on the next audio chunk
`tts`	TTS segmentation, language, delivery preset, conversational context, timestamp alignment	Mostly — `conversational` and `user_turn_mode` are locked at session open
`memory`	Automatic conversation memory and summarization	Yes
`backchannel`	Short interjections while the user is speaking	Yes
`responsiveness`	Filler audio while the main LLM warms up after a user turn	Yes

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      stt: { /* ... */ },
      tts: { /* ... */ },
      memory: { /* ... */ },
      backchannel: { /* ... */ },
      responsiveness: { /* ... */ }
    }
  }
}));

Partial updates are supported on every branch — omit a field to keep its current value. providerData also accepts a top-level behavior flag, auto_tool_response, plus the user_id and metadata fields. These aren’t configuration branches. See Tool continuation and Session metadata below.

Tool continuation

providerData.auto_tool_response controls who starts the next response after the client adds a function_call_output item. Setting it to false is useful if you are migrating from OpenAI, or you want consecutive tool calls followed by only one response.

Value	Behavior
`true` (default)	Inworld automatically starts a follow-up response. Do not send `response.create` after sending function call output.
`false`	OpenAI-compatible behavior: the client must send `response.create` after the tool output.

We recommend keeping this value at its default, since it ensures your tool calls are responded to as soon as possible. Set it to false only when necessary (e.g., when a third-party integration requires compatibility).

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      auto_tool_response: false
    }
  }
}));

The field is hot-swappable. Omitting it from a later partial update preserves its current value.

STT (`providerData.stt`)

Inworld extensions to the OpenAI-standard STT config. Every field here is hot-swappable; the STT stream is restarted automatically so the next chunk of audio uses the new value.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: { transcription: { model: 'inworld/inworld-stt-1' } }
    },
    providerData: {
      stt: {
        prompt: 'Medical dictation. Vocabulary: angioplasty.',
        voice_profile: true,
        language_hints: ['en-US', 'es-MX'],
        end_of_turn_confidence_threshold: 0.7,
        min_end_of_turn_silence: 200,
        max_turn_silence: 5000,
        vad_threshold: 0.5
      }
    }
  }
}));

Field	Type	Description
`prompt`	string	Transcription guidance (vocabulary hints, domain context, formatting preferences). Equivalent to `audio.input.transcription.prompt`.
`voice_profile`	boolean	When `true`, attach voice-profile signals (age, gender, emotion, vocal style, accent) to transcription events under `providerData.voiceProfile`. See Voice profile payload below for the returned shape.
`language_hints`	string[]	BCP-47-ish hints to bias recognition without committing to a single language. Soniox-specific (`soniox/stt-rt-v4`); ignored by other models.
`end_of_turn_confidence_threshold`	number	STT end-of-turn confidence cutoff (`0.0`–`1.0`). Explicit override of the `semantic_vad.eagerness` mapping.
`vad_threshold`	number	Speech/silence VAD cutoff (`0.0`–`1.0`). Explicit override of the eagerness mapping.
`min_end_of_turn_silence`	integer	Minimum trailing silence (ms) before STT considers a turn finished. Explicit override of the eagerness mapping.
`max_turn_silence`	integer	Hard ceiling (ms) on within-turn silence before STT force-closes the turn. Explicit override of the eagerness mapping.

For the eagerness preset that these fields override, see semantic_vad.

Voice profile payload

When providerData.stt.voice_profile is true, every conversation.item.input_audio_transcription.delta and conversation.item.input_audio_transcription.completed event carries a providerData.voiceProfile object alongside the transcript text:

{
  "type": "conversation.item.input_audio_transcription.completed",
  "event_id": "evt_5f7d2",
  "item_id": "item_aud_01HF…",
  "content_index": 0,
  "transcript": "Hello, how are you?",
  "providerData": {
    "voiceProfile": {
      "age":         [{ "label": "adult",          "confidence": 0.78 }],
      "gender":      [{ "label": "female",         "confidence": 0.91 }],
      "emotion":     [{ "label": "neutral",        "confidence": 0.65 }],
      "vocal_style": [{ "label": "conversational", "confidence": 0.82 }],
      "accent":      [{ "label": "en-US",          "confidence": 0.88 }]
    }
  }
}

Each top-level key is an array of { label, confidence } objects sorted by descending confidence. Keys are omitted when the STT backend does not produce labels for that category, so always null-check before reading. Confidence values are in [0.0, 1.0].

Category	Notes
`age`	Estimated age band of the speaker.
`gender`	Estimated gender of the speaker.
`emotion`	Detected emotional tone in the current segment. Can shift across deltas within a single turn.
`vocal_style`	Speaking style (e.g. `conversational`, `narration`, `whisper`, `monotone`).
`accent`	Regional accent or dialect as a BCP-47-like locale code (e.g. `en-US`, `en-GB`).

Voice profile is computed by the realtime service regardless of the STT backend, so voice_profile: true works across all supported STT models.

TTS (`providerData.tts`)

Controls how the LLM text stream is segmented and forwarded to the TTS backend, the language and delivery preset used for synthesis, (for TTS-2) whether a shared upstream context is preserved across turns, and opt-in timestamp alignment for lip-sync or captions. Available on inworld-tts-1.5-mini and inworld-tts-2.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { model: 'inworld-tts-2', voice: 'Olivia' } },
    providerData: {
      tts: {
        segmenter_strategy: 'sentence',
        steering_handling: 'emit_once',
        language: 'en-US',
        delivery_mode: 'CREATIVE',
        conversational: false,
        user_turn_mode: 'both',
        timestamp_type: 'WORD',
        timestamp_transport_strategy: 'SYNC'
      }
    }
  }
}));

Field	Type	Description
`segmenter_strategy`	string	How the LLM token stream is chunked before being forwarded to TTS. One of `auto`, `balanced`, `sentence`, `full_turn`, `fast_start`, `per_segment_context`. Empty string inherits the server default. Hot-swappable. See Segmenter strategies.
`steering_handling`	string	How to handle a leading `[steering]` tag captured from the LLM turn. `repeat_each_chunk` re-prepends it to every TTS request (default). `emit_once` prepends it only to the first request — recommended for `inworld-tts-2`. Hot-swappable.
`language`	string	BCP-47 tag (e.g. `"en-US"`, `"pt-BR"`) forwarded to TTS as the synthesis language. Independent from `audio.input.transcription.language` — STT and TTS can use different languages. Empty string lets the TTS backend infer. Hot-swappable.
`delivery_mode`	string	TTS-2 generation preset trading off stability vs. expressiveness. One of `STABLE`, `BALANCED`, `CREATIVE` (case-insensitive). Empty or unrecognised values are treated as unspecified. No-op for non-TTS-2 models. Hot-swappable.
`conversational`	boolean	TTS-2 only. When `true`, opens a single shared upstream TTS context for the entire WebSocket session. Locked at session open; mid-session toggles are ignored. See Conversational TTS.
`user_turn_mode`	string	Conversational-mode only. Which channels of the user turn are forwarded to TTS before each assistant generation. One of `both` (default), `audio_only`, `text_only`, or `none`. No-op outside conversational mode. Locked at session open.
`timestamp_type`	string	Opt into TTS alignment. `WORD` returns word-level timing with phoneme/viseme detail; `CHARACTER` returns per-character timing. Case-insensitive. Empty or unset = no alignment (default). Adds latency — opt in only when needed. Hot-swappable; send empty string to opt out mid-session. See TTS timestamps and alignment.
`timestamp_transport_strategy`	string	Controls how alignment data is delivered when `timestamp_type` is set. `SYNC` returns `timestamp_info` on the same `response.output_audio.delta` chunk as the audio — use for real-time lip-sync or word highlighting. `ASYNC` sends audio-only deltas first, then trailing deltas with empty `delta` and populated `timestamp_info` — lower time-to-first-audio but no strict matching. Empty = backend decides. No-op when `timestamp_type` is unset. Hot-swappable.

Conversational TTS

Setting providerData.tts.conversational = true opts TTS-2 into a multi-turn shared context: the upstream TurnContext sees every user and assistant turn for the lifetime of the WebSocket. This lets the model condition its delivery on the audio history of the conversation. The trade-off is a longer-lived state on the TTS backend and slightly higher per-turn cost. In conversational mode, segmenter_strategy is internally locked to full_turn semantics. Per-sentence and per-segment-context strategies are coerced (with a server-side WARN) because they would either fragment the upstream history or open a fresh context per segment, both of which defeat the multi-turn TurnContext.

With conversational: true, TTS conditions each response on the audio of previous turns — higher per-turn cost in exchange for potentially more natural output. Off by default.

Segmenter strategies

Strategy	Behaviour
`auto`	Default. `inworld-tts-2` uses sentence splits; older models use balanced splits.
`balanced`	Punctuation + conjunction splits. Tuned for `inworld-tts-1.5`.
`sentence`	Hard terminal-punctuation splits only. Tuned for `inworld-tts-2`.
`full_turn`	Buffer the entire LLM turn and emit it at turn end. Highest quality, highest latency.
`fast_start`	Strict sentence rules for the first emission, then a relaxed config (larger chunks, no idle-flush) for the rest of the turn. Optimizes time-to-first-audio.
`per_segment_context`	Each segment opens a fresh TTS context on the duplex stream. Per-segment handles are serialized so audio order is preserved.

TTS timestamps and alignment

Setting timestamp_type opts into timing data on response.output_audio.delta events. This is useful for lip-sync animation (viseme blending), word-level highlighting, or karaoke-style captions.

Choosing sync vs async

Strategy	Behaviour	Use when
`SYNC`	`timestamp_info` arrives on the same `response.output_audio.delta` as the audio bytes.	Real-time lip-sync or word highlighting — you need the timing before playback of that chunk.
`ASYNC`	Audio-only deltas stream first; alignment arrives in trailing deltas with an empty `delta` field and populated `timestamp_info`.	Low-latency playback — you don’t need timing until after the audio has played, or you post-process alignment offline.

Output shape — `response.output_audio.delta`

When timestamps are enabled, the response.output_audio.delta event carries an optional timestamp_info field. Exactly one of word_alignment or character_alignment is populated, matching the requested timestamp_type.

{
  "type": "response.output_audio.delta",
  "event_id": "evt_abc123",
  "response_id": "resp_001",
  "item_id": "item_audio_01",
  "output_index": 0,
  "content_index": 0,
  "delta": "<base64 audio or empty for ASYNC trailing>",
  "timestamp_info": {
    "word_alignment": {
      "words": ["Hello", ",", " ", "world"],
      "word_start_time_seconds": [0.0, 0.32, 0.35, 0.38],
      "word_end_time_seconds": [0.32, 0.35, 0.38, 0.72],
      "phonetic_details": [
        {
          "word_index": 0,
          "phones": [
            { "phone_symbol": "HH", "start_time_seconds": 0.0, "duration_seconds": 0.08, "viseme_symbol": "chjsh" },
            { "phone_symbol": "AH", "start_time_seconds": 0.08, "duration_seconds": 0.10, "viseme_symbol": "aei" },
            { "phone_symbol": "L",  "start_time_seconds": 0.18, "duration_seconds": 0.06, "viseme_symbol": "l" },
            { "phone_symbol": "OW", "start_time_seconds": 0.24, "duration_seconds": 0.08, "viseme_symbol": "o" }
          ],
          "is_partial": false
        }
      ]
    }
  }
}

word_alignment (when timestamp_type = "WORD"):

Field	Type	Description
`words`	string[]	Tokens in the original text — words, punctuation, and whitespace — in order.
`word_start_time_seconds`	number[]	Start time of each token, relative to the beginning of the synthesized stream.
`word_end_time_seconds`	number[]	End time of each token.
`phonetic_details`	object[]	Per-word phoneme timing and viseme symbols (TTS 1.5 and TTS-2 only).

Each entry in phonetic_details:

Field	Type	Description
`word_index`	integer	Index into `words[]` this detail covers.
`phones`	object[]	Phoneme spans with `phone_symbol`, `start_time_seconds`, `duration_seconds`, and `viseme_symbol`.
`is_partial`	boolean	`true` when this is a partial update (SYNC mid-word boundary); `false` once the word is fully synthesized.

character_alignment (when timestamp_type = "CHARACTER"):

Field	Type	Description
`characters`	string[]	Individual characters/punctuation, in order.
`character_start_time_seconds`	number[]	Start time of each character.
`character_end_time_seconds`	number[]	End time of each character.

For the full viseme symbol table and per-language timestamp support, see TTS timestamps.

WebRTC

Over WebRTC, audio travels on the RTP media track (not as base64). Alignment data is delivered on the data channel in the same response.output_audio.delta event shape, but the delta field is always an empty string (the audio is already on the media track).

Memory (`providerData.memory`)

Automatic conversation memory and summarization. When enabled, the server periodically asks the LLM to extract durable facts and a rolling summary, prepends them to the system prompt, and trims the transcript so context stays bounded.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      memory: {
        enabled: true,
        turn_interval: 5,
        max_memory_length: 2000,
        max_transcript_items: 40,
        max_facts: 50,
        trim_after_summarize: true
      }
    }
  }
}));

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable automatic memory generation.
`turn_interval`	integer	`5`	Generate memory every N completed turns.
`max_memory_length`	integer	`2000`	Maximum character length for the rolling summary.
`max_transcript_items`	integer	`40`	Maximum conversation items to keep after trimming.
`max_facts`	integer	`50`	Maximum facts retained in `state.facts`.
`trim_after_summarize`	boolean	`true`	Remove old transcript items after summarization.

After each generation cycle the server populates providerData.memory.state (read-only) and emits a session.updated event so clients can observe the rolling summary, fact list, and bookkeeping counters.

Back-channel (`providerData.backchannel`)

Short audio interjections — "uh-huh", "right", "I see" — emitted while the user is still speaking. Opt-in per session and gated by server prerequisites; contact your account team to confirm prerequisites for your deployment. For event handling (the response.backchannel.audio.delta / .done / .skipped events), client integration tips, and tuning guidance, see the dedicated Back-channel guide.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      backchannel: {
        enabled: true,
        eval_interval_ms: 800,
        min_speech_ms: 800,
        min_gap_ms: 4000,
        max_per_turn: 3,
        hard_deadline_ms: 1500,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 6,
        volume_gain: 0.6,
        require_pause: false,
        decider_kind: 'llm'
      }
    }
  }
}));

Field	Type	Default	Description
`enabled`	boolean	`false`	Per-session opt-in. Sessions that don’t send this field never receive back-channels.
`small_model`	string	server default	Override the decider LLM model identifier. Empty string inherits the default.
`eval_interval_ms`	integer	`800`	How often the manager evaluates eligibility while the user is producing partial transcripts.
`min_speech_ms`	integer	`800`	Minimum time after speech onset before any back-channel can fire.
`min_gap_ms`	integer	`4000`	Minimum spacing between two back-channels in the same user turn.
`max_per_turn`	integer	`3`	Cap on back-channels emitted within a single user turn.
`hard_deadline_ms`	integer	`1500`	Combined small-LLM + TTS deadline per attempt. Misses are dropped.
`history_tail_items`	integer	`4`	Recent conversation items the small LLM sees as context.
`temperature`	number	`0.7`	Sampling temperature for the small LLM.
`max_tokens`	integer	`6`	Max tokens for the small LLM’s reply.
`volume_gain`	number	`0.6`	Linear gain multiplier applied to synthesized back-channel audio. `0.0` mutes; `1.0` keeps the synthesized volume; >1.0 amplifies.
`require_pause`	boolean	`false`	When `true`, only fire after a smart-turn pause signal (`input_audio_buffer.turn_suggestion`).
`allowed_phrases`	string[]	server default	Restrict the phrase bank. `null` / omitted inherits the default; an explicit empty array disables back-channel for the session; a populated array replaces the bank.
`prompt_template`	string	server default	Override the decider prompt. Supports Go `text/template` tokens `{{.PhrasesList}}`, `{{.History}}`, `{{.Partial}}`.
`decider_kind`	`"llm"` \| `"rule"`	`"llm"`	`llm` uses a small LLM. `rule` picks phrases from the bank with per-tick probability `rule_fire_probability`.
`rule_fire_probability`	number	`1.0`	Per-tick fire probability for the rule decider (`0.0`–`1.0`). Ignored when `decider_kind != "rule"`.

Sending providerData.backchannel: {} (empty object) clears all overrides; the server falls back to its compiled-in defaults.

Responsiveness (`providerData.responsiveness`)

Short filler audio ("let me think", "one moment") spoken after the user’s turn ends if the main LLM is slow to produce its first delta. Opt-in per session and gated by two server prerequisites (a small filler model and an Unleash flag); contact your account team to confirm both are in place. For how the filler races the main LLM, TTS pipeline details, and tuning guidance, see the dedicated Responsiveness guide.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        hard_deadline_ms: 2000,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 12,
        min_filler_gap_ms: 8000,
        max_initial_per_turn: 1,
        enable_filler_on_first_assistant_reply: false,
        pause_text: ''
      }
    }
  }
}));

Field	Type	Default	Description
`enabled`	boolean	`false`	Per-session opt-in. A session that does not send this object never gets a filler.
`small_model`	string	server default	Override the filler LLM model identifier.
`initial_wait_timeout_ms`	integer	server default	T — how long to wait for the main LLM’s first delta before committing to the filler. Lower values fire fillers more aggressively.
`hard_deadline_ms`	integer	server default	Caps the small / filler LLM’s total streaming time so a slow filler model can’t become a latency tax.
`history_tail_items`	integer	server default	Recent conversation items the small LLM sees as context.
`temperature`	number	server default	Sampling temperature for the small LLM.
`max_tokens`	integer	server default	Caps the small LLM’s response length. Keep small — fillers should be brief.
`min_filler_gap_ms`	integer	server default	Minimum gap between any two fillers within a single user-turn chain.
`max_initial_per_turn`	integer	`1`	Caps initial fillers per user turn.
`max_buffer_deltas`	integer	server default	Bounds the in-memory buffer of main-LLM deltas held while the filler is being spoken.
`enable_filler_on_first_assistant_reply`	boolean	`false`	Allows responsiveness fillers on the very first assistant response in a session.
`prompt_template`	string	server default	Overrides the system prompt fed to the small filler LLM. Append a language directive here for multilingual sessions.
`pause_text`	string	server default	TTS-only hint injected between the filler and the main answer (e.g. a brief connector word). Empty string disables injection.

Text generation config (`text_generation_config`)

Fine-grained LLM generation parameters sent as a top-level field on the session object (alongside model, temperature, providerData, etc.). The same object is also accepted under providerData.text_generation_config for compatibility — both paths are merged into the same state.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    text_generation_config: {
      reasoning: {
        effort: 'HIGH',
        maxTokens: 1024,
        exclude: false
      }
    }
  }
}));

Reasoning

Controls chain-of-thought reasoning on models that support it. The server forwards this as extra_body.reasoning to the LLM Router.

Field	Type	Description
`effort`	string	Reasoning depth. One of `NONE`, `MINIMAL`, `LOW`, `MEDIUM`, `HIGH`, `XHIGH` (case-sensitive on the wire). `NONE` disables reasoning entirely; higher values allocate more thinking tokens.
`maxTokens`	integer	Cap on reasoning/thinking tokens the model may emit.
`exclude`	boolean	When `true`, reasoning tokens are generated but excluded from the response text — useful for latency-sensitive paths where you still want reasoning to influence quality.

Parameter support varies by model. Some models do not support reasoning at all, while others support only a subset of effort levels (e.g. gemini-3.1-pro does not support MINIMAL). If the upstream model rejects the requested effort, the LLM Router returns a 400 error.

When reasoning is omitted entirely, the server uses the model’s default reasoning behaviour. For reasoning-capable models this default may not be NONE — meaning reasoning tokens (and their latency) are added implicitly. If you need minimal latency on a reasoning-capable model, explicitly set effort: "NONE" to disable reasoning. Reasoning token usage is reported in response.done under usage.output_token_details.reasoning_tokens.

Other fields

Field	Type	Description
`maxNewTokens`	integer	Max completion tokens. Equivalent to `max_output_tokens` on the session.
`temperature`	number	Sampling temperature override (takes precedence over session-level `temperature`).
`topP`	number	Nucleus sampling.
`frequencyPenalty`	number	Frequency penalty.
`presencePenalty`	number	Presence penalty.
`repetitionPenalty`	number	Repetition penalty (model-specific).
`stopSequences`	string[]	Custom stop sequences.
`seed`	integer	Deterministic sampling seed (model-specific).
`logitBias`	object[]	Per-token likelihood adjustments (`{ tokenId, biasValue }`).

All fields are optional and hot-swappable.

Prompt caching (`providerData.caching`)

Opt a session into explicit prompt caching for the stable, every-turn-resent blocks — the system instructions and the tool definitions. When enabled, the server attaches an ephemeral cache_control breakpoint to those blocks so providers that support explicit caching (Anthropic, Google) can serve them from cache, cutting input-token cost on every turn after the first. Providers with implicit caching (OpenAI, Gemini 2.5, DeepSeek) cache automatically and ignore the breakpoint, so leaving this off does not disable their caching.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      caching: {
        enabled: true,
        ttl: '5m',
        cache_instructions: true,
        cache_tools: true
      }
    }
  }
}));

Field	Type	Default	Description
`enabled`	boolean	`false`	Master switch. When off (or absent), no `cache_control` breakpoints are sent.
`ttl`	string	provider default	Cache lifetime for the breakpoint, e.g. `"5m"`, `"1h"`. Empty string lets the provider default apply.
`cache_instructions`	boolean	`true` when enabled	Attach a breakpoint to the system instructions block.
`cache_tools`	boolean	`true` when enabled	Attach a breakpoint to the last tool definition (caches the whole tool block).

Explicit caching only pays off for large blocks — providers apply a minimum cacheable size (around 1024 tokens). Below that, Anthropic silently skips the cache, while Google may reject the request. Enable caching only when the instructions and/or tool definitions are substantial. For the cache_control protocol, TTL prolongation, and the full list of supported providers, see Prompt caching.

Cache usage is reported on response.done under usage.input_token_details: cached_tokens (tokens served from a cache hit) and cache_write_tokens (tokens written when establishing a new cache entry). All fields are hot-swappable.

Session metadata

Two optional fields sit alongside the five branches at the top of providerData. They don’t configure STT, TTS, or memory — they tag the session so it can be traced, correlated, and routed downstream.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      user_id: 'user_abc123',
      metadata: {
        tenant: 'acme-corp',
        experiment: 'voice-preset-A'
      }
    }
  }
}));

Field	Type	Description
`user_id`	string	Stable per-user identifier surfaced in tracing, logs, and downstream service requests. Useful for cross-session memory keying and incident debugging.
`metadata`	object (string → string)	Arbitrary key-value pairs forwarded to the LLM router as `extra_body.metadata`. Use for downstream-routing hints or A/B-test bucketing. This is separate from `response.metadata`, which echoes client metadata supplied on `response.create`.

Both fields are optional and hot-swappable.

Hot-swap reference

Most providerData fields take effect on the next audio chunk or turn after the session.update is acknowledged. The exceptions — locked once at session open and ignored afterwards — are:

providerData.tts.conversational
providerData.tts.user_turn_mode

If you need to change either of these, open a new WebSocket session.

​Branch overview

​Tool continuation

​STT (providerData.stt)

​Voice profile payload

​TTS (providerData.tts)

​Conversational TTS

​Segmenter strategies

​TTS timestamps and alignment

​Choosing sync vs async

​Output shape — response.output_audio.delta

​WebRTC

​Memory (providerData.memory)

​Back-channel (providerData.backchannel)

​Responsiveness (providerData.responsiveness)

​Text generation config (text_generation_config)

​Reasoning

​Other fields

​Prompt caching (providerData.caching)

​Session metadata

​Hot-swap reference

​See also

Branch overview

Tool continuation

STT (`providerData.stt`)

Voice profile payload

TTS (`providerData.tts`)

Conversational TTS

Segmenter strategies

TTS timestamps and alignment

Choosing sync vs async

Output shape — `response.output_audio.delta`

WebRTC

Memory (`providerData.memory`)

Back-channel (`providerData.backchannel`)

Responsiveness (`providerData.responsiveness`)

Text generation config (`text_generation_config`)

Reasoning

Other fields

Prompt caching (`providerData.caching`)

Session metadata

Hot-swap reference

See also