> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Inworld Realtime API Extensions

> The Inworld Realtime API is wire-compatible with the OpenAI Realtime spec, and adds extensions — STT tuning, TTS segmentation, automatic memory, back-channel, and responsiveness fillers — through a single providerData field on the session object.

The Inworld Realtime API is **wire-compatible with the OpenAI Realtime spec** — clients written against OpenAI's session, audio, and response events work against Inworld unchanged. On top of that baseline, Inworld layers production-grade extensions that improve quality, latency, and conversational naturalness:

* **STT tuning** — voice profile signals, language hints, explicit end-of-turn and VAD overrides
* **TTS segmentation, steering, and alignment** — pick how the LLM token stream is chunked into TTS calls, the synthesis language, the TTS-2 delivery preset, (for TTS-2) a shared multi-turn context, and opt into word/character-level timestamp alignment for lip-sync or captions
* **Automatic conversation memory** — periodic summarization and fact extraction that keep long sessions inside the context window
* **Back-channel** — short interjections (`"uh-huh"`, `"I see"`) emitted *while the user is still speaking*, so the agent feels like an active listener
* **Responsiveness fillers** — short filler audio (`"let me think"`) spoken in the gap *after a user turn* if the main LLM is slow to produce its first delta

Everything Inworld adds beyond the OpenAI spec is exposed through a **single field** on the session object: **`providerData`**. Send it inside any `session.update` and the server merges it with current state. Most fields hot-swap mid-session and take effect on the next audio chunk or turn; the locked-at-session-open exceptions are called out in the [hot-swap reference](#hot-swap-reference) at the bottom of this page.

This page is the field-by-field reference for the full `providerData` surface. For task-driven walkthroughs (language switching, conversation management, etc.) and for the event-handling client code that pairs with back-channel and responsiveness, see the linked guides under each branch.

## Branch overview

`providerData` is a flat object with five branches. Each branch is independent — send only the ones you want to configure.

| Branch                                                          | Purpose                                                                                  | Hot-swap                                                                  |
| --------------------------------------------------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| [`stt`](#stt-providerdata-stt)                                  | STT tuning (voice profile, language hints, end-of-turn thresholds, VAD overrides)        | Yes — STT stream is restarted on the next audio chunk                     |
| [`tts`](#tts-providerdata-tts)                                  | TTS segmentation, language, delivery preset, conversational context, timestamp alignment | Mostly — `conversational` and `user_turn_mode` are locked at session open |
| [`memory`](#memory-providerdata-memory)                         | Automatic conversation memory and summarization                                          | Yes                                                                       |
| [`backchannel`](#back-channel-providerdata-backchannel)         | Short interjections while the user is speaking                                           | Yes                                                                       |
| [`responsiveness`](#responsiveness-providerdata-responsiveness) | Filler audio while the main LLM warms up after a user turn                               | Yes                                                                       |

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      stt: { /* ... */ },
      tts: { /* ... */ },
      memory: { /* ... */ },
      backchannel: { /* ... */ },
      responsiveness: { /* ... */ }
    }
  }
}));
```

Partial updates are supported on every branch — omit a field to keep its current value.

`providerData` also accepts two top-level metadata fields alongside the branches: `user_id` and `metadata`. They aren't configuration branches; they tag the session for tracing and downstream routing. See [Session metadata](#session-metadata) below.

## STT (`providerData.stt`)

Inworld extensions to the OpenAI-standard STT config. Every field here is hot-swappable; the STT stream is restarted automatically so the next chunk of audio uses the new value.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: { transcription: { model: 'inworld/inworld-stt-1' } }
    },
    providerData: {
      stt: {
        prompt: 'Medical dictation. Vocabulary: angioplasty.',
        voice_profile: true,
        language_hints: ['en-US', 'es-MX'],
        end_of_turn_confidence_threshold: 0.7,
        min_end_of_turn_silence: 200,
        max_turn_silence: 5000,
        vad_threshold: 0.5
      }
    }
  }
}));
```

| Field                              | Type      | Description                                                                                                                                                                                                                        |
| ---------------------------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `prompt`                           | string    | Transcription guidance (vocabulary hints, domain context, formatting preferences). Equivalent to `audio.input.transcription.prompt`.                                                                                               |
| `voice_profile`                    | boolean   | When `true`, attach voice-profile signals (age, gender, emotion, vocal style, accent) to transcription events under `providerData.voiceProfile`. See [Voice profile payload](#voice-profile-payload) below for the returned shape. |
| `language_hints`                   | string\[] | BCP-47-ish hints to bias recognition without committing to a single language. Soniox-specific (`soniox/stt-rt-v4`); ignored by other models.                                                                                       |
| `end_of_turn_confidence_threshold` | number    | STT end-of-turn confidence cutoff (`0.0`–`1.0`). Explicit override of the `semantic_vad.eagerness` mapping.                                                                                                                        |
| `vad_threshold`                    | number    | Speech/silence VAD cutoff (`0.0`–`1.0`). Explicit override of the eagerness mapping.                                                                                                                                               |
| `min_end_of_turn_silence`          | integer   | Minimum trailing silence (ms) before STT considers a turn finished. Explicit override of the eagerness mapping.                                                                                                                    |
| `max_turn_silence`                 | integer   | Hard ceiling (ms) on within-turn silence before STT force-closes the turn. Explicit override of the eagerness mapping.                                                                                                             |

For the eagerness preset that these fields override, see [`semantic_vad`](/realtime/usage/using-realtime-models#semantic-vad).

### Voice profile payload

When `providerData.stt.voice_profile` is `true`, every `conversation.item.input_audio_transcription.delta` and `conversation.item.input_audio_transcription.completed` event carries a `providerData.voiceProfile` object alongside the transcript text:

```json theme={"system"}
{
  "type": "conversation.item.input_audio_transcription.completed",
  "event_id": "evt_5f7d2",
  "item_id": "item_aud_01HF…",
  "content_index": 0,
  "transcript": "Hello, how are you?",
  "providerData": {
    "voiceProfile": {
      "age":         [{ "label": "adult",          "confidence": 0.78 }],
      "gender":      [{ "label": "female",         "confidence": 0.91 }],
      "emotion":     [{ "label": "neutral",        "confidence": 0.65 }],
      "vocal_style": [{ "label": "conversational", "confidence": 0.82 }],
      "accent":      [{ "label": "en-US",          "confidence": 0.88 }]
    }
  }
}
```

Each top-level key is an array of `{ label, confidence }` objects sorted by descending confidence. Keys are omitted when the STT backend does not produce labels for that category, so always null-check before reading. Confidence values are in `[0.0, 1.0]`.

| Category      | Notes                                                                                         |
| ------------- | --------------------------------------------------------------------------------------------- |
| `age`         | Estimated age band of the speaker.                                                            |
| `gender`      | Estimated gender of the speaker.                                                              |
| `emotion`     | Detected emotional tone in the current segment. Can shift across deltas within a single turn. |
| `vocal_style` | Speaking style (e.g. `conversational`, `narration`, `whisper`, `monotone`).                   |
| `accent`      | Regional accent or dialect as a BCP-47-like locale code (e.g. `en-US`, `en-GB`).              |

Voice profile is computed by the realtime service regardless of the STT backend, so `voice_profile: true` works across all supported STT models.

## TTS (`providerData.tts`)

Controls how the LLM text stream is segmented and forwarded to the TTS backend, the language and delivery preset used for synthesis, (for TTS-2) whether a shared upstream context is preserved across turns, and opt-in timestamp alignment for lip-sync or captions. Available on `inworld-tts-1.5-mini` and `inworld-tts-2`.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { model: 'inworld-tts-2', voice: 'Olivia' } },
    providerData: {
      tts: {
        segmenter_strategy: 'sentence',
        steering_handling: 'emit_once',
        language: 'en-US',
        delivery_mode: 'CREATIVE',
        conversational: false,
        user_turn_mode: 'both',
        timestamp_type: 'WORD',
        timestamp_transport_strategy: 'SYNC'
      }
    }
  }
}));
```

| Field                          | Type    | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| ------------------------------ | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `segmenter_strategy`           | string  | How the LLM token stream is chunked before being forwarded to TTS. One of `auto`, `balanced`, `sentence`, `full_turn`, `fast_start`, `per_segment_context`. Empty string inherits the server default. Hot-swappable. See [Segmenter strategies](#segmenter-strategies).                                                                                                                                                                                                |
| `steering_handling`            | string  | How to handle a leading `[steering] ` tag captured from the LLM turn. `repeat_each_chunk` re-prepends it to every TTS request (default). `emit_once` prepends it only to the first request — recommended for `inworld-tts-2`. Hot-swappable.                                                                                                                                                                                                                           |
| `language`                     | string  | BCP-47 tag (e.g. `"en-US"`, `"pt-BR"`) forwarded to TTS as the synthesis language. Independent from `audio.input.transcription.language` — STT and TTS can use different languages. Empty string lets the TTS backend infer. Hot-swappable.                                                                                                                                                                                                                            |
| `delivery_mode`                | string  | TTS-2 generation preset trading off stability vs. expressiveness. One of `STABLE`, `BALANCED`, `CREATIVE` (case-insensitive). Empty or unrecognised values are treated as unspecified. No-op for non-TTS-2 models. Hot-swappable.                                                                                                                                                                                                                                      |
| `conversational`               | boolean | TTS-2 only. When `true`, opens a single shared upstream TTS context for the entire WebSocket session. **Locked at session open**; mid-session toggles are ignored. See [Conversational TTS](#conversational-tts).                                                                                                                                                                                                                                                      |
| `user_turn_mode`               | string  | Conversational-mode only. Which channels of the user turn are forwarded to TTS before each assistant generation. One of `both` (default), `audio_only`, `text_only`, or `none`. No-op outside conversational mode. **Locked at session open**.                                                                                                                                                                                                                         |
| `timestamp_type`               | string  | Opt into TTS alignment. `WORD` returns word-level timing with phoneme/viseme detail; `CHARACTER` returns per-character timing. Case-insensitive. Empty or unset = no alignment (default). Adds latency — opt in only when needed. Hot-swappable; send empty string to opt out mid-session. See [TTS timestamps and alignment](#tts-timestamps-and-alignment).                                                                                                          |
| `timestamp_transport_strategy` | string  | Controls how alignment data is delivered when `timestamp_type` is set. `SYNC` returns `timestamp_info` on the same `response.output_audio.delta` chunk as the audio — use for real-time lip-sync or word highlighting. `ASYNC` sends audio-only deltas first, then trailing deltas with empty `delta` and populated `timestamp_info` — lower time-to-first-audio but no strict matching. Empty = backend decides. No-op when `timestamp_type` is unset. Hot-swappable. |

### Conversational TTS

Setting `providerData.tts.conversational = true` opts TTS-2 into a multi-turn shared context: the upstream TurnContext sees every user and assistant turn for the lifetime of the WebSocket. This lets the model condition its delivery on the audio history of the conversation. The trade-off is a longer-lived state on the TTS backend and slightly higher per-turn cost.

In conversational mode, `segmenter_strategy` is internally locked to `full_turn` semantics. Per-sentence and per-segment-context strategies are coerced (with a server-side WARN) because they would either fragment the upstream history or open a fresh context per segment, both of which defeat the multi-turn TurnContext.

<Note>
  With `conversational: true`, TTS conditions each response on the audio of previous turns — higher per-turn cost in exchange for potentially more natural output. Off by default.
</Note>

### Segmenter strategies

| Strategy              | Behaviour                                                                                                                                                   |
| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `auto`                | Default. `inworld-tts-2` uses sentence splits; older models use balanced splits.                                                                            |
| `balanced`            | Punctuation + conjunction splits. Tuned for `inworld-tts-1.5`.                                                                                              |
| `sentence`            | Hard terminal-punctuation splits only. Tuned for `inworld-tts-2`.                                                                                           |
| `full_turn`           | Buffer the entire LLM turn and emit it at turn end. Highest quality, highest latency.                                                                       |
| `fast_start`          | Strict sentence rules for the first emission, then a relaxed config (larger chunks, no idle-flush) for the rest of the turn. Optimizes time-to-first-audio. |
| `per_segment_context` | Each segment opens a fresh TTS context on the duplex stream. Per-segment handles are serialized so audio order is preserved.                                |

### TTS timestamps and alignment

Setting `timestamp_type` opts into timing data on `response.output_audio.delta` events. This is useful for lip-sync animation (viseme blending), word-level highlighting, or karaoke-style captions.

#### Choosing sync vs async

| Strategy | Behaviour                                                                                                                            | Use when                                                                                                              |
| -------- | ------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------- |
| `SYNC`   | `timestamp_info` arrives on the **same** `response.output_audio.delta` as the audio bytes.                                           | Real-time lip-sync or word highlighting — you need the timing before playback of that chunk.                          |
| `ASYNC`  | Audio-only deltas stream first; alignment arrives in **trailing** deltas with an empty `delta` field and populated `timestamp_info`. | Low-latency playback — you don't need timing until after the audio has played, or you post-process alignment offline. |

#### Output shape — `response.output_audio.delta`

When timestamps are enabled, the `response.output_audio.delta` event carries an optional `timestamp_info` field. Exactly one of `word_alignment` or `character_alignment` is populated, matching the requested `timestamp_type`.

```json theme={"system"}
{
  "type": "response.output_audio.delta",
  "event_id": "evt_abc123",
  "response_id": "resp_001",
  "item_id": "item_audio_01",
  "output_index": 0,
  "content_index": 0,
  "delta": "<base64 audio or empty for ASYNC trailing>",
  "timestamp_info": {
    "word_alignment": {
      "words": ["Hello", ",", " ", "world"],
      "word_start_time_seconds": [0.0, 0.32, 0.35, 0.38],
      "word_end_time_seconds": [0.32, 0.35, 0.38, 0.72],
      "phonetic_details": [
        {
          "word_index": 0,
          "phones": [
            { "phone_symbol": "HH", "start_time_seconds": 0.0, "duration_seconds": 0.08, "viseme_symbol": "chjsh" },
            { "phone_symbol": "AH", "start_time_seconds": 0.08, "duration_seconds": 0.10, "viseme_symbol": "aei" },
            { "phone_symbol": "L",  "start_time_seconds": 0.18, "duration_seconds": 0.06, "viseme_symbol": "l" },
            { "phone_symbol": "OW", "start_time_seconds": 0.24, "duration_seconds": 0.08, "viseme_symbol": "o" }
          ],
          "is_partial": false
        }
      ]
    }
  }
}
```

**`word_alignment`** (when `timestamp_type = "WORD"`):

| Field                     | Type      | Description                                                                    |
| ------------------------- | --------- | ------------------------------------------------------------------------------ |
| `words`                   | string\[] | Tokens in the original text — words, punctuation, and whitespace — in order.   |
| `word_start_time_seconds` | number\[] | Start time of each token, relative to the beginning of the synthesized stream. |
| `word_end_time_seconds`   | number\[] | End time of each token.                                                        |
| `phonetic_details`        | object\[] | Per-word phoneme timing and viseme symbols (TTS 1.5 and TTS-2 only).           |

Each entry in `phonetic_details`:

| Field        | Type      | Description                                                                                                |
| ------------ | --------- | ---------------------------------------------------------------------------------------------------------- |
| `word_index` | integer   | Index into `words[]` this detail covers.                                                                   |
| `phones`     | object\[] | Phoneme spans with `phone_symbol`, `start_time_seconds`, `duration_seconds`, and `viseme_symbol`.          |
| `is_partial` | boolean   | `true` when this is a partial update (SYNC mid-word boundary); `false` once the word is fully synthesized. |

**`character_alignment`** (when `timestamp_type = "CHARACTER"`):

| Field                          | Type      | Description                                  |
| ------------------------------ | --------- | -------------------------------------------- |
| `characters`                   | string\[] | Individual characters/punctuation, in order. |
| `character_start_time_seconds` | number\[] | Start time of each character.                |
| `character_end_time_seconds`   | number\[] | End time of each character.                  |

<Note>
  For the full viseme symbol table and per-language timestamp support, see [TTS timestamps](/tts/capabilities/timestamps).
</Note>

#### WebRTC

Over WebRTC, audio travels on the RTP media track (not as base64). Alignment data is delivered on the **data channel** in the same `response.output_audio.delta` event shape, but the `delta` field is always an empty string (the audio is already on the media track).

## Memory (`providerData.memory`)

Automatic conversation memory and summarization. When enabled, the server periodically asks the LLM to extract durable facts and a rolling summary, prepends them to the system prompt, and trims the transcript so context stays bounded.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      memory: {
        enabled: true,
        turn_interval: 5,
        max_memory_length: 2000,
        max_transcript_items: 40,
        max_facts: 50,
        trim_after_summarize: true
      }
    }
  }
}));
```

| Field                  | Type    | Default | Description                                        |
| ---------------------- | ------- | ------- | -------------------------------------------------- |
| `enabled`              | boolean | `false` | Enable automatic memory generation.                |
| `turn_interval`        | integer | `5`     | Generate memory every N completed turns.           |
| `max_memory_length`    | integer | `2000`  | Maximum character length for the rolling summary.  |
| `max_transcript_items` | integer | `40`    | Maximum conversation items to keep after trimming. |
| `max_facts`            | integer | `50`    | Maximum facts retained in `state.facts`.           |
| `trim_after_summarize` | boolean | `true`  | Remove old transcript items after summarization.   |

After each generation cycle the server populates `providerData.memory.state` (read-only) and emits a `session.updated` event so clients can observe the rolling summary, fact list, and bookkeeping counters.

## Back-channel (`providerData.backchannel`)

Short audio interjections — `"uh-huh"`, `"right"`, `"I see"` — emitted **while the user is still speaking**. Opt-in per session and gated by server prerequisites; contact your account team to confirm prerequisites for your deployment.

For event handling (the `response.backchannel.audio.delta` / `.done` / `.skipped` events), client integration tips, and tuning guidance, see the dedicated [Back-channel](/realtime/usage/back-channel) guide.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      backchannel: {
        enabled: true,
        eval_interval_ms: 800,
        min_speech_ms: 800,
        min_gap_ms: 4000,
        max_per_turn: 3,
        hard_deadline_ms: 1500,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 6,
        volume_gain: 0.6,
        require_pause: false,
        decider_kind: 'llm'
      }
    }
  }
}));
```

| Field                   | Type                | Default        | Description                                                                                                                                                          |
| ----------------------- | ------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `enabled`               | boolean             | `false`        | Per-session opt-in. Sessions that don't send this field never receive back-channels.                                                                                 |
| `small_model`           | string              | server default | Override the decider LLM model identifier. Empty string inherits the default.                                                                                        |
| `eval_interval_ms`      | integer             | `800`          | How often the manager evaluates eligibility while the user is producing partial transcripts.                                                                         |
| `min_speech_ms`         | integer             | `800`          | Minimum time after speech onset before any back-channel can fire.                                                                                                    |
| `min_gap_ms`            | integer             | `4000`         | Minimum spacing between two back-channels in the same user turn.                                                                                                     |
| `max_per_turn`          | integer             | `3`            | Cap on back-channels emitted within a single user turn.                                                                                                              |
| `hard_deadline_ms`      | integer             | `1500`         | Combined small-LLM + TTS deadline per attempt. Misses are dropped.                                                                                                   |
| `history_tail_items`    | integer             | `4`            | Recent conversation items the small LLM sees as context.                                                                                                             |
| `temperature`           | number              | `0.7`          | Sampling temperature for the small LLM.                                                                                                                              |
| `max_tokens`            | integer             | `6`            | Max tokens for the small LLM's reply.                                                                                                                                |
| `volume_gain`           | number              | `0.6`          | Linear gain multiplier applied to synthesized back-channel audio. `0.0` mutes; `1.0` keeps the synthesized volume; >1.0 amplifies.                                   |
| `require_pause`         | boolean             | `false`        | When `true`, only fire after a smart-turn pause signal (`input_audio_buffer.turn_suggestion`).                                                                       |
| `allowed_phrases`       | string\[]           | server default | Restrict the phrase bank. `null` / omitted inherits the default; an explicit empty array disables back-channel for the session; a populated array replaces the bank. |
| `prompt_template`       | string              | server default | Override the decider prompt. Supports Go `text/template` tokens `{{.PhrasesList}}`, `{{.History}}`, `{{.Partial}}`.                                                  |
| `decider_kind`          | `"llm"` \| `"rule"` | `"llm"`        | `llm` uses a small LLM. `rule` picks phrases from the bank with per-tick probability `rule_fire_probability`.                                                        |
| `rule_fire_probability` | number              | `1.0`          | Per-tick fire probability for the rule decider (`0.0`–`1.0`). Ignored when `decider_kind != "rule"`.                                                                 |

Sending `providerData.backchannel: {}` (empty object) clears all overrides; the server falls back to its compiled-in defaults.

## Responsiveness (`providerData.responsiveness`)

Short filler audio (`"let me think"`, `"one moment"`) spoken **after the user's turn ends** if the main LLM is slow to produce its first delta. Opt-in per session and gated by two server prerequisites (a small filler model and an Unleash flag); contact your account team to confirm both are in place.

For how the filler races the main LLM, TTS pipeline details, and tuning guidance, see the dedicated [Responsiveness](/realtime/usage/responsiveness) guide.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        hard_deadline_ms: 2000,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 12,
        min_filler_gap_ms: 8000,
        max_initial_per_turn: 1,
        enable_filler_on_first_assistant_reply: false,
        pause_text: ''
      }
    }
  }
}));
```

| Field                                    | Type    | Default        | Description                                                                                                                       |
| ---------------------------------------- | ------- | -------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `enabled`                                | boolean | `false`        | Per-session opt-in. A session that does not send this object never gets a filler.                                                 |
| `small_model`                            | string  | server default | Override the filler LLM model identifier.                                                                                         |
| `initial_wait_timeout_ms`                | integer | server default | T — how long to wait for the main LLM's first delta before committing to the filler. Lower values fire fillers more aggressively. |
| `hard_deadline_ms`                       | integer | server default | Caps the small / filler LLM's total streaming time so a slow filler model can't become a latency tax.                             |
| `history_tail_items`                     | integer | server default | Recent conversation items the small LLM sees as context.                                                                          |
| `temperature`                            | number  | server default | Sampling temperature for the small LLM.                                                                                           |
| `max_tokens`                             | integer | server default | Caps the small LLM's response length. Keep small — fillers should be brief.                                                       |
| `min_filler_gap_ms`                      | integer | server default | Minimum gap between any two fillers within a single user-turn chain.                                                              |
| `max_initial_per_turn`                   | integer | `1`            | Caps initial fillers per user turn.                                                                                               |
| `max_buffer_deltas`                      | integer | server default | Bounds the in-memory buffer of main-LLM deltas held while the filler is being spoken.                                             |
| `enable_filler_on_first_assistant_reply` | boolean | `false`        | Allows responsiveness fillers on the very first assistant response in a session.                                                  |
| `prompt_template`                        | string  | server default | Overrides the system prompt fed to the small filler LLM. Append a language directive here for multilingual sessions.              |
| `pause_text`                             | string  | server default | TTS-only hint injected between the filler and the main answer (e.g. a brief connector word). Empty string disables injection.     |

## Text generation config (`text_generation_config`)

Fine-grained LLM generation parameters sent as a **top-level** field on the session object (alongside `model`, `temperature`, `providerData`, etc.). The same object is also accepted under `providerData.text_generation_config` for compatibility — both paths are merged into the same state.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    text_generation_config: {
      reasoning: {
        effort: 'HIGH',
        maxTokens: 1024,
        exclude: false
      }
    }
  }
}));
```

### Reasoning

Controls chain-of-thought reasoning on models that support it. The server forwards this as `extra_body.reasoning` to the LLM Router.

| Field       | Type    | Description                                                                                                                                                                                |
| ----------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `effort`    | string  | Reasoning depth. One of `NONE`, `MINIMAL`, `LOW`, `MEDIUM`, `HIGH`, `XHIGH` (case-sensitive on the wire). `NONE` disables reasoning entirely; higher values allocate more thinking tokens. |
| `maxTokens` | integer | Cap on reasoning/thinking tokens the model may emit.                                                                                                                                       |
| `exclude`   | boolean | When `true`, reasoning tokens are generated but excluded from the response text — useful for latency-sensitive paths where you still want reasoning to influence quality.                  |

<Note>
  Parameter support varies by model. Some models do not support reasoning at all, while others support only a subset of effort levels (e.g. `gemini-3.1-pro` does not support `MINIMAL`). If the upstream model rejects the requested effort, the LLM Router returns a 400 error.
</Note>

When `reasoning` is omitted entirely, the server uses the model's default reasoning behaviour. For reasoning-capable models this default may not be `NONE` — meaning reasoning tokens (and their latency) are added implicitly. If you need minimal latency on a reasoning-capable model, explicitly set `effort: "NONE"` to disable reasoning.

Reasoning token usage is reported in `response.done` under `usage.output_token_details.reasoning_tokens`.

### Other fields

| Field               | Type      | Description                                                                        |
| ------------------- | --------- | ---------------------------------------------------------------------------------- |
| `maxNewTokens`      | integer   | Max completion tokens. Equivalent to `max_output_tokens` on the session.           |
| `temperature`       | number    | Sampling temperature override (takes precedence over session-level `temperature`). |
| `topP`              | number    | Nucleus sampling.                                                                  |
| `frequencyPenalty`  | number    | Frequency penalty.                                                                 |
| `presencePenalty`   | number    | Presence penalty.                                                                  |
| `repetitionPenalty` | number    | Repetition penalty (model-specific).                                               |
| `stopSequences`     | string\[] | Custom stop sequences.                                                             |
| `seed`              | integer   | Deterministic sampling seed (model-specific).                                      |
| `logitBias`         | object\[] | Per-token likelihood adjustments (`{ tokenId, biasValue }`).                       |

All fields are optional and hot-swappable.

## Prompt caching (`providerData.caching`)

Opt a session into **explicit prompt caching** for the stable, every-turn-resent blocks — the system instructions and the tool definitions. When enabled, the server attaches an ephemeral `cache_control` breakpoint to those blocks so providers that support explicit caching (Anthropic, Google) can serve them from cache, cutting input-token cost on every turn after the first. Providers with implicit caching (OpenAI, Gemini 2.5, DeepSeek) cache automatically and ignore the breakpoint, so leaving this off does not disable their caching.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      caching: {
        enabled: true,
        ttl: '5m',
        cache_instructions: true,
        cache_tools: true
      }
    }
  }
}));
```

| Field                | Type    | Default             | Description                                                                                           |
| -------------------- | ------- | ------------------- | ----------------------------------------------------------------------------------------------------- |
| `enabled`            | boolean | `false`             | Master switch. When off (or absent), no `cache_control` breakpoints are sent.                         |
| `ttl`                | string  | provider default    | Cache lifetime for the breakpoint, e.g. `"5m"`, `"1h"`. Empty string lets the provider default apply. |
| `cache_instructions` | boolean | `true` when enabled | Attach a breakpoint to the system instructions block.                                                 |
| `cache_tools`        | boolean | `true` when enabled | Attach a breakpoint to the last tool definition (caches the whole tool block).                        |

<Note>
  Explicit caching only pays off for large blocks — providers apply a minimum cacheable size (around **1024 tokens**). Below that, Anthropic silently skips the cache, while Google may reject the request. Enable caching only when the instructions and/or tool definitions are substantial. For the `cache_control` protocol, TTL prolongation, and the full list of supported providers, see [Prompt caching](/router/capabilities/caching).
</Note>

Cache usage is reported on `response.done` under `usage.input_token_details`: `cached_tokens` (tokens served from a cache hit) and `cache_write_tokens` (tokens written when establishing a new cache entry). All fields are hot-swappable.

## Session metadata

Two optional fields sit alongside the five branches at the top of `providerData`. They don't configure STT, TTS, or memory — they tag the session so it can be traced, correlated, and routed downstream.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      user_id: 'user_abc123',
      metadata: {
        tenant: 'acme-corp',
        experiment: 'voice-preset-A'
      }
    }
  }
}));
```

| Field      | Type                     | Description                                                                                                                                                             |
| ---------- | ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `user_id`  | string                   | Stable per-user identifier surfaced in tracing, logs, and downstream service requests. Useful for cross-session memory keying and incident debugging.                   |
| `metadata` | object (string → string) | Arbitrary key-value pairs forwarded to the LLM router as `extra_body.metadata`. Use for downstream-routing hints, customer-side correlation IDs, or A/B-test bucketing. |

Both fields are optional and hot-swappable.

## Hot-swap reference

Most providerData fields take effect on the next audio chunk or turn after the `session.update` is acknowledged. The exceptions — locked once at session open and ignored afterwards — are:

* `providerData.tts.conversational`
* `providerData.tts.user_turn_mode`

If you need to change either of these, open a new WebSocket session.

## See also

* [WebSocket API reference](/api-reference/realtimeAPI/realtime/realtime-websocket) — schema-rendered playground
* [Configuring Models](/realtime/usage/using-realtime-models) — model selection, voice, VAD, function calling, reasoning effort
* [TTS timestamps](/tts/capabilities/timestamps) — viseme symbol table and per-language support for the standalone TTS API
* [Back-channel](/realtime/usage/back-channel) — event handling and tuning for `providerData.backchannel`
* [Responsiveness](/realtime/usage/responsiveness) — filler racing logic and tuning for `providerData.responsiveness`