The realtime demo’s natural cadence comes from three things:Documentation Index
Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
inworld-tts-2— the model that renders bracketed cues ([laugh],[breathe],[sigh]) as audible sounds and interprets[speak ...]steering tags.providerData.ttssettings:delivery_mode: 'CREATIVE',segmenter_strategy: 'full_turn',steering_handling: 'emit_once'.- A system prompt that tells the LLM to emit those tags.
Sarah.
session.update
Send this after the WebSocket opens. ReplaceSYSTEM_PROMPT with the prompt in the next section.
delivery_mode: 'CREATIVE'is the most expressive ofSTABLE | BALANCED | CREATIVE.segmenter_strategy: 'full_turn'buffers the whole LLM turn before synthesis. Highest quality, highest time-to-first-audio. Switch tofast_startorsentenceif latency matters more.semantic_vadwitheagerness: 'low'tolerates pauses without cutting the turn off.backchannelemits “uh-huh” while the user is still talking.responsivenessemits a short filler if the LLM is slow to reply.
System prompt
inworld-tts-2 only renders tags that appear in the text. The LLM has to be told to emit them. Use this as a starter:
What each piece does
delivery_mode: 'CREATIVE'.STABLEandBALANCEDflatten prosody.CREATIVEallows more expressive variation.segmenter_strategy: 'full_turn'. Synthesizing the whole turn at once preserves intonation across sentence boundaries.- Prompting to output steering and non-verbals. Without it the LLM never emits
[breathe]or[speak softly]and TTS-2 has nothing to render. This is the biggest single contributor. backchannel+responsiveness. Optional. Fill dead air during user turns and slow LLM warmups respectively.
Tweaks
- Latency-sensitive (mobile, WebRTC). Use
segmenter_strategy: 'fast_start'or'sentence'. - Voice cloning. Cloning gives you timbre. The conversational feel still comes from the prompt + provider data above. See voice cloning.
- Non-English. Append a language directive to the system prompt. Keep
[speak ...]and[laugh]/[breathe]tags in English. SetproviderData.tts.languagefor the TTS accent.
See also
- API Extensions reference — full
providerDatasurface - Configuring Realtime Models — model selection, voice, semantic VAD
- TTS-2 steering and voice tags
- Back-channel and Responsiveness