Adding Naturalness - Inworld AI Documentation

The realtime demo’s natural cadence comes from three things:

inworld-tts-2 — the model that renders bracketed cues ([laugh], [breathe], [sigh]) as audible sounds and interprets [speak ...] steering tags.
providerData.tts settings: delivery_mode: 'CREATIVE', segmenter_strategy: 'full_turn', steering_handling: 'emit_once'.
A system prompt that tells the LLM to emit those tags.

Voice choice matters less than the three above. The demo uses Sarah.

session.update

Send this after the WebSocket opens. Replace SYSTEM_PROMPT with the prompt in the next section.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    instructions: SYSTEM_PROMPT,
    audio: {
      output: {
        model: 'inworld-tts-2',
        voice: 'Sarah'
      },
      input: {
        turn_detection: {
          type: 'semantic_vad',
          eagerness: 'low'
        }
      }
    },
    providerData: {
      tts: {
        delivery_mode: 'CREATIVE',
        segmenter_strategy: 'full_turn',
        steering_handling: 'emit_once'
      },
      backchannel: { enabled: true },
      responsiveness: { enabled: true }
    }
  }
}));

Notes:

delivery_mode: 'CREATIVE' is the most expressive of STABLE | BALANCED | CREATIVE.
segmenter_strategy: 'full_turn' buffers the whole LLM turn before synthesis. Highest quality, highest time-to-first-audio. Switch to fast_start or sentence if latency matters more.
semantic_vad with eagerness: 'low' tolerates pauses without cutting the turn off.
backchannel emits “uh-huh” while the user is still talking. responsiveness emits a short filler if the LLM is slow to reply.

System prompt

inworld-tts-2 only renders tags that appear in the text. The LLM has to be told to emit them. Use this as a starter:

You are a warm, conversational AI on a voice call. Speak the way a person speaks, not the way a chatbot writes.

TURN LENGTH
5 to 10 words by default. A backchannel ("yeah", "mm-hm", "right", "huh") is often the whole turn. Go longer only when the user asks you to explain or walk through something.

NON-VERBALS — six bracketed sounds the voice can produce: [laugh], [breathe], [sigh], [cough], [clear throat], [yawn]. Use where a person would actually make that sound. At most one per turn, often none.

STEERING TAGS — at most ONE [speak ...] tag per turn. If used, it MUST be the first thing in the turn. Use it only when the user's emotional register has shifted, or when they ask for a specific style:
- User excited / shared good news → [speak with bright energy, faster, warmer]
- User frustrated → [speak evenly, slower, lower volume, no defensiveness]
- User vulnerable, paused on something hard → [speak softly, slower, with warmth]
- User asks for a specific voice ("speak like a pirate") → honor it literally and stay in that voice until they drop it
Default is no tag — tone carries through word choice and rhythm. Once you've shifted manner, keep it across turns without re-tagging.

SMALL DISFLUENCIES
- Fillers: "um", "uh", "hmm"
- Soft openers: "oh", "well", "so", "right", "okay"
- Hedges: "kind of", "I guess", "maybe"
- Self-repairs: "I, I think"
- Backchannels: "yeah", "mm-hm", "right"
Zero to two per turn, often none.

Steering tags ([speak ...]), non-verbals ([laugh], [breathe], etc.), and stage modifiers always stay in English even if the conversation is in another language. Only the spoken words switch.

Layer your own persona on top.

What each piece does

delivery_mode: 'CREATIVE'. STABLE and BALANCED flatten prosody. CREATIVE allows more expressive variation.
segmenter_strategy: 'full_turn'. Synthesizing the whole turn at once preserves intonation across sentence boundaries.
Prompting to output steering and non-verbals. Without it the LLM never emits [breathe] or [speak softly] and TTS-2 has nothing to render. This is the biggest single contributor.
backchannel + responsiveness. Optional. Fill dead air during user turns and slow LLM warmups respectively.

Tweaks

Latency-sensitive (mobile, WebRTC). Use segmenter_strategy: 'fast_start' or 'sentence'.
Voice cloning. Cloning gives you timbre. The conversational feel still comes from the prompt + provider data above. See voice cloning.
Non-English. Append a language directive to the system prompt. Keep [speak ...] and [laugh] / [breathe] tags in English. Set providerData.tts.language for the TTS accent.

​session.update

​System prompt

​What each piece does

​Tweaks

​See also

session.update

System prompt

What each piece does

Tweaks

See also