Skip to main content
The Inworld Realtime API is wire-compatible with the OpenAI Realtime spec — clients written against OpenAI’s session, audio, and response events work against Inworld unchanged. On top of that baseline, Inworld layers production-grade extensions that improve quality, latency, and conversational naturalness:
  • STT tuning — voice profile signals, language hints, explicit end-of-turn and VAD overrides
  • TTS segmentation, steering, and alignment — pick how the LLM token stream is chunked into TTS calls, the synthesis language, the TTS-2 delivery preset, (for TTS-2) a shared multi-turn context, and opt into word/character-level timestamp alignment for lip-sync or captions
  • Automatic conversation memory — periodic summarization and fact extraction that keep long sessions inside the context window
  • Back-channel — short interjections ("uh-huh", "I see") emitted while the user is still speaking, so the agent feels like an active listener
  • Responsiveness fillers — short filler audio ("let me think") spoken in the gap after a user turn if the main LLM is slow to produce its first delta
Everything Inworld adds beyond the OpenAI spec is exposed through a single field on the session object: providerData. Send it inside any session.update and the server merges it with current state. Most fields hot-swap mid-session and take effect on the next audio chunk or turn; the locked-at-session-open exceptions are called out in the hot-swap reference at the bottom of this page. This page is the field-by-field reference for the full providerData surface. For task-driven walkthroughs (language switching, conversation management, etc.) and for the event-handling client code that pairs with back-channel and responsiveness, see the linked guides under each branch.

Branch overview

providerData is a flat object with five branches. Each branch is independent — send only the ones you want to configure.
BranchPurposeHot-swap
sttSTT tuning (voice profile, language hints, end-of-turn thresholds, VAD overrides)Yes — STT stream is restarted on the next audio chunk
ttsTTS segmentation, language, delivery preset, conversational context, timestamp alignmentMostly — conversational and user_turn_mode are locked at session open
memoryAutomatic conversation memory and summarizationYes
backchannelShort interjections while the user is speakingYes
responsivenessFiller audio while the main LLM warms up after a user turnYes
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      stt: { /* ... */ },
      tts: { /* ... */ },
      memory: { /* ... */ },
      backchannel: { /* ... */ },
      responsiveness: { /* ... */ }
    }
  }
}));
Partial updates are supported on every branch — omit a field to keep its current value. providerData also accepts two top-level metadata fields alongside the branches: user_id and metadata. They aren’t configuration branches; they tag the session for tracing and downstream routing. See Session metadata below.

STT (providerData.stt)

Inworld extensions to the OpenAI-standard STT config. Every field here is hot-swappable; the STT stream is restarted automatically so the next chunk of audio uses the new value.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: { transcription: { model: 'inworld/inworld-stt-1' } }
    },
    providerData: {
      stt: {
        prompt: 'Medical dictation. Vocabulary: angioplasty.',
        voice_profile: true,
        language_hints: ['en-US', 'es-MX'],
        end_of_turn_confidence_threshold: 0.7,
        min_end_of_turn_silence: 200,
        max_turn_silence: 5000,
        vad_threshold: 0.5
      }
    }
  }
}));
FieldTypeDescription
promptstringTranscription guidance (vocabulary hints, domain context, formatting preferences). Equivalent to audio.input.transcription.prompt.
voice_profilebooleanWhen true, attach voice-profile signals (age, gender, emotion, vocal style, accent) to transcription events under providerData.voiceProfile. See Voice profile payload below for the returned shape.
language_hintsstring[]BCP-47-ish hints to bias recognition without committing to a single language. Soniox-specific (soniox/stt-rt-v4); ignored by other models.
end_of_turn_confidence_thresholdnumberSTT end-of-turn confidence cutoff (0.01.0). Explicit override of the semantic_vad.eagerness mapping.
vad_thresholdnumberSpeech/silence VAD cutoff (0.01.0). Explicit override of the eagerness mapping.
min_end_of_turn_silenceintegerMinimum trailing silence (ms) before STT considers a turn finished. Explicit override of the eagerness mapping.
max_turn_silenceintegerHard ceiling (ms) on within-turn silence before STT force-closes the turn. Explicit override of the eagerness mapping.
For the eagerness preset that these fields override, see semantic_vad.

Voice profile payload

When providerData.stt.voice_profile is true, every conversation.item.input_audio_transcription.delta and conversation.item.input_audio_transcription.completed event carries a providerData.voiceProfile object alongside the transcript text:
{
  "type": "conversation.item.input_audio_transcription.completed",
  "event_id": "evt_5f7d2",
  "item_id": "item_aud_01HF…",
  "content_index": 0,
  "transcript": "Hello, how are you?",
  "providerData": {
    "voiceProfile": {
      "age":         [{ "label": "adult",          "confidence": 0.78 }],
      "gender":      [{ "label": "female",         "confidence": 0.91 }],
      "emotion":     [{ "label": "neutral",        "confidence": 0.65 }],
      "vocal_style": [{ "label": "conversational", "confidence": 0.82 }],
      "accent":      [{ "label": "en-US",          "confidence": 0.88 }]
    }
  }
}
Each top-level key is an array of { label, confidence } objects sorted by descending confidence. Keys are omitted when the STT backend does not produce labels for that category, so always null-check before reading. Confidence values are in [0.0, 1.0].
CategoryNotes
ageEstimated age band of the speaker.
genderEstimated gender of the speaker.
emotionDetected emotional tone in the current segment. Can shift across deltas within a single turn.
vocal_styleSpeaking style (e.g. conversational, narration, whisper, monotone).
accentRegional accent or dialect as a BCP-47-like locale code (e.g. en-US, en-GB).
Voice profile is computed by the realtime service regardless of the STT backend, so voice_profile: true works across all supported STT models.

TTS (providerData.tts)

Controls how the LLM text stream is segmented and forwarded to the TTS backend, the language and delivery preset used for synthesis, (for TTS-2) whether a shared upstream context is preserved across turns, and opt-in timestamp alignment for lip-sync or captions. Available on inworld-tts-1.5-mini and inworld-tts-2.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { model: 'inworld-tts-2', voice: 'Olivia' } },
    providerData: {
      tts: {
        segmenter_strategy: 'sentence',
        steering_handling: 'emit_once',
        language: 'en-US',
        delivery_mode: 'CREATIVE',
        conversational: false,
        user_turn_mode: 'both',
        timestamp_type: 'WORD',
        timestamp_transport_strategy: 'SYNC'
      }
    }
  }
}));
FieldTypeDescription
segmenter_strategystringHow the LLM token stream is chunked before being forwarded to TTS. One of auto, balanced, sentence, full_turn, fast_start, per_segment_context. Empty string inherits the server default. Hot-swappable. See Segmenter strategies.
steering_handlingstringHow to handle a leading [steering] tag captured from the LLM turn. repeat_each_chunk re-prepends it to every TTS request (default). emit_once prepends it only to the first request — recommended for inworld-tts-2. Hot-swappable.
languagestringBCP-47 tag (e.g. "en-US", "pt-BR") forwarded to TTS as the synthesis language. Independent from audio.input.transcription.language — STT and TTS can use different languages. Empty string lets the TTS backend infer. Hot-swappable.
delivery_modestringTTS-2 generation preset trading off stability vs. expressiveness. One of STABLE, BALANCED, CREATIVE (case-insensitive). Empty or unrecognised values are treated as unspecified. No-op for non-TTS-2 models. Hot-swappable.
conversationalbooleanTTS-2 only. When true, opens a single shared upstream TTS context for the entire WebSocket session. Locked at session open; mid-session toggles are ignored. See Conversational TTS.
user_turn_modestringConversational-mode only. Which channels of the user turn are forwarded to TTS before each assistant generation. One of both (default), audio_only, text_only, or none. No-op outside conversational mode. Locked at session open.
timestamp_typestringOpt into TTS alignment. WORD returns word-level timing with phoneme/viseme detail; CHARACTER returns per-character timing. Case-insensitive. Empty or unset = no alignment (default). Adds latency — opt in only when needed. Hot-swappable; send empty string to opt out mid-session. See TTS timestamps and alignment.
timestamp_transport_strategystringControls how alignment data is delivered when timestamp_type is set. SYNC returns timestamp_info on the same response.output_audio.delta chunk as the audio — use for real-time lip-sync or word highlighting. ASYNC sends audio-only deltas first, then trailing deltas with empty delta and populated timestamp_info — lower time-to-first-audio but no strict matching. Empty = backend decides. No-op when timestamp_type is unset. Hot-swappable.

Conversational TTS

Setting providerData.tts.conversational = true opts TTS-2 into a multi-turn shared context: the upstream TurnContext sees every user and assistant turn for the lifetime of the WebSocket. This lets the model condition its delivery on the audio history of the conversation. The trade-off is a longer-lived state on the TTS backend and slightly higher per-turn cost. In conversational mode, segmenter_strategy is internally locked to full_turn semantics. Per-sentence and per-segment-context strategies are coerced (with a server-side WARN) because they would either fragment the upstream history or open a fresh context per segment, both of which defeat the multi-turn TurnContext.
With conversational: true, TTS conditions each response on the audio of previous turns — higher per-turn cost in exchange for potentially more natural output. Off by default.

Segmenter strategies

StrategyBehaviour
autoDefault. inworld-tts-2 uses sentence splits; older models use balanced splits.
balancedPunctuation + conjunction splits. Tuned for inworld-tts-1.5.
sentenceHard terminal-punctuation splits only. Tuned for inworld-tts-2.
full_turnBuffer the entire LLM turn and emit it at turn end. Highest quality, highest latency.
fast_startStrict sentence rules for the first emission, then a relaxed config (larger chunks, no idle-flush) for the rest of the turn. Optimizes time-to-first-audio.
per_segment_contextEach segment opens a fresh TTS context on the duplex stream. Per-segment handles are serialized so audio order is preserved.

TTS timestamps and alignment

Setting timestamp_type opts into timing data on response.output_audio.delta events. This is useful for lip-sync animation (viseme blending), word-level highlighting, or karaoke-style captions.

Choosing sync vs async

StrategyBehaviourUse when
SYNCtimestamp_info arrives on the same response.output_audio.delta as the audio bytes.Real-time lip-sync or word highlighting — you need the timing before playback of that chunk.
ASYNCAudio-only deltas stream first; alignment arrives in trailing deltas with an empty delta field and populated timestamp_info.Low-latency playback — you don’t need timing until after the audio has played, or you post-process alignment offline.

Output shape — response.output_audio.delta

When timestamps are enabled, the response.output_audio.delta event carries an optional timestamp_info field. Exactly one of word_alignment or character_alignment is populated, matching the requested timestamp_type.
{
  "type": "response.output_audio.delta",
  "event_id": "evt_abc123",
  "response_id": "resp_001",
  "item_id": "item_audio_01",
  "output_index": 0,
  "content_index": 0,
  "delta": "<base64 audio or empty for ASYNC trailing>",
  "timestamp_info": {
    "word_alignment": {
      "words": ["Hello", ",", " ", "world"],
      "word_start_time_seconds": [0.0, 0.32, 0.35, 0.38],
      "word_end_time_seconds": [0.32, 0.35, 0.38, 0.72],
      "phonetic_details": [
        {
          "word_index": 0,
          "phones": [
            { "phone_symbol": "HH", "start_time_seconds": 0.0, "duration_seconds": 0.08, "viseme_symbol": "chjsh" },
            { "phone_symbol": "AH", "start_time_seconds": 0.08, "duration_seconds": 0.10, "viseme_symbol": "aei" },
            { "phone_symbol": "L",  "start_time_seconds": 0.18, "duration_seconds": 0.06, "viseme_symbol": "l" },
            { "phone_symbol": "OW", "start_time_seconds": 0.24, "duration_seconds": 0.08, "viseme_symbol": "o" }
          ],
          "is_partial": false
        }
      ]
    }
  }
}
word_alignment (when timestamp_type = "WORD"):
FieldTypeDescription
wordsstring[]Tokens in the original text — words, punctuation, and whitespace — in order.
word_start_time_secondsnumber[]Start time of each token, relative to the beginning of the synthesized stream.
word_end_time_secondsnumber[]End time of each token.
phonetic_detailsobject[]Per-word phoneme timing and viseme symbols (TTS 1.5 and TTS-2 only).
Each entry in phonetic_details:
FieldTypeDescription
word_indexintegerIndex into words[] this detail covers.
phonesobject[]Phoneme spans with phone_symbol, start_time_seconds, duration_seconds, and viseme_symbol.
is_partialbooleantrue when this is a partial update (SYNC mid-word boundary); false once the word is fully synthesized.
character_alignment (when timestamp_type = "CHARACTER"):
FieldTypeDescription
charactersstring[]Individual characters/punctuation, in order.
character_start_time_secondsnumber[]Start time of each character.
character_end_time_secondsnumber[]End time of each character.
For the full viseme symbol table and per-language timestamp support, see TTS timestamps.

WebRTC

Over WebRTC, audio travels on the RTP media track (not as base64). Alignment data is delivered on the data channel in the same response.output_audio.delta event shape, but the delta field is always an empty string (the audio is already on the media track).

Memory (providerData.memory)

Automatic conversation memory and summarization. When enabled, the server periodically asks the LLM to extract durable facts and a rolling summary, prepends them to the system prompt, and trims the transcript so context stays bounded.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      memory: {
        enabled: true,
        turn_interval: 5,
        max_memory_length: 2000,
        max_transcript_items: 40,
        max_facts: 50,
        trim_after_summarize: true
      }
    }
  }
}));
FieldTypeDefaultDescription
enabledbooleanfalseEnable automatic memory generation.
turn_intervalinteger5Generate memory every N completed turns.
max_memory_lengthinteger2000Maximum character length for the rolling summary.
max_transcript_itemsinteger40Maximum conversation items to keep after trimming.
max_factsinteger50Maximum facts retained in state.facts.
trim_after_summarizebooleantrueRemove old transcript items after summarization.
After each generation cycle the server populates providerData.memory.state (read-only) and emits a session.updated event so clients can observe the rolling summary, fact list, and bookkeeping counters.

Back-channel (providerData.backchannel)

Short audio interjections — "uh-huh", "right", "I see" — emitted while the user is still speaking. Opt-in per session and gated by server prerequisites; contact your account team to confirm prerequisites for your deployment. For event handling (the response.backchannel.audio.delta / .done / .skipped events), client integration tips, and tuning guidance, see the dedicated Back-channel guide.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      backchannel: {
        enabled: true,
        eval_interval_ms: 800,
        min_speech_ms: 800,
        min_gap_ms: 4000,
        max_per_turn: 3,
        hard_deadline_ms: 1500,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 6,
        volume_gain: 0.6,
        require_pause: false,
        decider_kind: 'llm'
      }
    }
  }
}));
FieldTypeDefaultDescription
enabledbooleanfalsePer-session opt-in. Sessions that don’t send this field never receive back-channels.
small_modelstringserver defaultOverride the decider LLM model identifier. Empty string inherits the default.
eval_interval_msinteger800How often the manager evaluates eligibility while the user is producing partial transcripts.
min_speech_msinteger800Minimum time after speech onset before any back-channel can fire.
min_gap_msinteger4000Minimum spacing between two back-channels in the same user turn.
max_per_turninteger3Cap on back-channels emitted within a single user turn.
hard_deadline_msinteger1500Combined small-LLM + TTS deadline per attempt. Misses are dropped.
history_tail_itemsinteger4Recent conversation items the small LLM sees as context.
temperaturenumber0.7Sampling temperature for the small LLM.
max_tokensinteger6Max tokens for the small LLM’s reply.
volume_gainnumber0.6Linear gain multiplier applied to synthesized back-channel audio. 0.0 mutes; 1.0 keeps the synthesized volume; >1.0 amplifies.
require_pausebooleanfalseWhen true, only fire after a smart-turn pause signal (input_audio_buffer.turn_suggestion).
allowed_phrasesstring[]server defaultRestrict the phrase bank. null / omitted inherits the default; an explicit empty array disables back-channel for the session; a populated array replaces the bank.
prompt_templatestringserver defaultOverride the decider prompt. Supports Go text/template tokens {{.PhrasesList}}, {{.History}}, {{.Partial}}.
decider_kind"llm" | "rule""llm"llm uses a small LLM. rule picks phrases from the bank with per-tick probability rule_fire_probability.
rule_fire_probabilitynumber1.0Per-tick fire probability for the rule decider (0.01.0). Ignored when decider_kind != "rule".
Sending providerData.backchannel: {} (empty object) clears all overrides; the server falls back to its compiled-in defaults.

Responsiveness (providerData.responsiveness)

Short filler audio ("let me think", "one moment") spoken after the user’s turn ends if the main LLM is slow to produce its first delta. Opt-in per session and gated by two server prerequisites (a small filler model and an Unleash flag); contact your account team to confirm both are in place. For how the filler races the main LLM, TTS pipeline details, and tuning guidance, see the dedicated Responsiveness guide.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        hard_deadline_ms: 2000,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 12,
        min_filler_gap_ms: 8000,
        max_initial_per_turn: 1,
        enable_filler_on_first_assistant_reply: false,
        pause_text: ''
      }
    }
  }
}));
FieldTypeDefaultDescription
enabledbooleanfalsePer-session opt-in. A session that does not send this object never gets a filler.
small_modelstringserver defaultOverride the filler LLM model identifier.
initial_wait_timeout_msintegerserver defaultT — how long to wait for the main LLM’s first delta before committing to the filler. Lower values fire fillers more aggressively.
hard_deadline_msintegerserver defaultCaps the small / filler LLM’s total streaming time so a slow filler model can’t become a latency tax.
history_tail_itemsintegerserver defaultRecent conversation items the small LLM sees as context.
temperaturenumberserver defaultSampling temperature for the small LLM.
max_tokensintegerserver defaultCaps the small LLM’s response length. Keep small — fillers should be brief.
min_filler_gap_msintegerserver defaultMinimum gap between any two fillers within a single user-turn chain.
max_initial_per_turninteger1Caps initial fillers per user turn.
max_buffer_deltasintegerserver defaultBounds the in-memory buffer of main-LLM deltas held while the filler is being spoken.
enable_filler_on_first_assistant_replybooleanfalseAllows responsiveness fillers on the very first assistant response in a session.
prompt_templatestringserver defaultOverrides the system prompt fed to the small filler LLM. Append a language directive here for multilingual sessions.
pause_textstringserver defaultTTS-only hint injected between the filler and the main answer (e.g. a brief connector word). Empty string disables injection.

Text generation config (text_generation_config)

Fine-grained LLM generation parameters sent as a top-level field on the session object (alongside model, temperature, providerData, etc.). The same object is also accepted under providerData.text_generation_config for compatibility — both paths are merged into the same state.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    text_generation_config: {
      reasoning: {
        effort: 'HIGH',
        maxTokens: 1024,
        exclude: false
      }
    }
  }
}));

Reasoning

Controls chain-of-thought reasoning on models that support it. The server forwards this as extra_body.reasoning to the LLM Router.
FieldTypeDescription
effortstringReasoning depth. One of NONE, MINIMAL, LOW, MEDIUM, HIGH, XHIGH (case-sensitive on the wire). NONE disables reasoning entirely; higher values allocate more thinking tokens.
maxTokensintegerCap on reasoning/thinking tokens the model may emit.
excludebooleanWhen true, reasoning tokens are generated but excluded from the response text — useful for latency-sensitive paths where you still want reasoning to influence quality.
Parameter support varies by model. Some models do not support reasoning at all, while others support only a subset of effort levels (e.g. gemini-3.1-pro does not support MINIMAL). If the upstream model rejects the requested effort, the LLM Router returns a 400 error.
When reasoning is omitted entirely, the server uses the model’s default reasoning behaviour. For reasoning-capable models this default may not be NONE — meaning reasoning tokens (and their latency) are added implicitly. If you need minimal latency on a reasoning-capable model, explicitly set effort: "NONE" to disable reasoning. Reasoning token usage is reported in response.done under usage.output_token_details.reasoning_tokens.

Other fields

FieldTypeDescription
maxNewTokensintegerMax completion tokens. Equivalent to max_output_tokens on the session.
temperaturenumberSampling temperature override (takes precedence over session-level temperature).
topPnumberNucleus sampling.
frequencyPenaltynumberFrequency penalty.
presencePenaltynumberPresence penalty.
repetitionPenaltynumberRepetition penalty (model-specific).
stopSequencesstring[]Custom stop sequences.
seedintegerDeterministic sampling seed (model-specific).
logitBiasobject[]Per-token likelihood adjustments ({ tokenId, biasValue }).
All fields are optional and hot-swappable.

Session metadata

Two optional fields sit alongside the five branches at the top of providerData. They don’t configure STT, TTS, or memory — they tag the session so it can be traced, correlated, and routed downstream.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      user_id: 'user_abc123',
      metadata: {
        tenant: 'acme-corp',
        experiment: 'voice-preset-A'
      }
    }
  }
}));
FieldTypeDescription
user_idstringStable per-user identifier surfaced in tracing, logs, and downstream service requests. Useful for cross-session memory keying and incident debugging.
metadataobject (string → string)Arbitrary key-value pairs forwarded to the LLM router as extra_body.metadata. Use for downstream-routing hints, customer-side correlation IDs, or A/B-test bucketing.
Both fields are optional and hot-swappable.

Hot-swap reference

Most providerData fields take effect on the next audio chunk or turn after the session.update is acknowledged. The exceptions — locked once at session open and ignored afterwards — are:
  • providerData.tts.conversational
  • providerData.tts.user_turn_mode
If you need to change either of these, open a new WebSocket session.

See also