Prompting for TTS-2 - Inworld AI Documentation

When an LLM generates text that gets fed into TTS, the default output often sounds flat and unnatural. With inworld-tts-2, you can go further: instruct the LLM to embed steering tags directly in its output. The result is speech that isn’t just well-formatted, but actively directed, with emotion, pacing, volume, and vocal style shaped by the LLM itself. This page covers what is new for inworld-tts-2. The guidance in Prompting for TTS still generally applies as a best practice, especially in cases where no steering instructions are applied.

Steering is fully supported only on inworld-tts-2. On prior models, descriptive steering instructions (e.g. [say with a hint of amusement]) may be spoken aloud verbatim instead of interpreted. For consistent results, use inworld-tts-2 for steering.

Instructing the LLM to use steering

The Steering page documents all supported instruction tags across emotion, speed, volume, vocal style, tone, non-verbals, and free-form directions. To make your LLM use them, include a section in your system prompt that explains the tag format and lists the tags relevant to your use case. Prompt snippet:

Your responses will be spoken aloud using inworld-tts-2, which supports
instruction tags — natural language directions in square brackets placed before
the text they apply to.

Use instruction tags to match your delivery to the content. The following are
suggestions; natural language instructions can be used to describe the
appropriate delivery:
- Emotion: [say excitedly], [sound sad], [sound concerned], [sound terrified]
- Articulation: [say with force], [articulate clearly], [say with deliberate pauses]
- Intonation: [say with a falling pitch], [say with a rising pitch]
- Volume: [very quiet], [very loud]
- Pitch: [say in a low tone], [say in a high pitch]
- Range: [say playfully], [say with no pitch variation]
- Speed: [very fast], [very slow]
- Vocal style: [whisper in a hushed style], [give a nasal quality]
- Non-verbals: [laugh], [sigh], [clear throat], [breathe], [cough], [yawn]

For maximum control, combine qualities from multiple categories in a single
natural language instruction. A bare tag like [sound sad] gives the model one
dimension to work with. A fuller instruction like [say sadly with deliberate
pauses in a low voice and hushed style] layers mood, rhythm, pitch, and mode —
producing a more nuanced and convincing performance.

Place the tag at the start of the text it applies to. A single tag can apply
across multiple sentences; repeat or change tags only when the delivery should
change. Non-verbal tags can also be used inline where they occur. Do not
apply a tag that contradicts the content of the text. Avoid combining opposing
directions in the same tag — for example, [whisper in a hushed style] and
[very loud] together produce unpredictable results.

Before (no instruction tags):

I have great news. Your package has arrived.

After (with instruction tags):

[say excitedly with a high pitch and fast pace] I have great news. Your package has arrived!

For the full list of supported tags and examples, see the Steering page.

Example Prompt Templates

Below are complete, copyable system prompt blocks for common use cases. Each template combines steering with the text formatting guidance from Prompting for TTS.

Companion / Conversational
Support / Sales
Dev Tools / Technical

Use this template for chatbots, AI companions, virtual friends, and other informal conversational applications.

## Speech Output Rules

Your responses will be converted to speech using inworld-tts-2. Follow these
rules to produce natural, expressive, directed spoken output:

### Instruction Tags
- Open with an instruction tag that captures the emotional quality of your
  response; combine mood, pitch, pacing, and manner for best results:
  [say excitedly with a high pitch and fast pace],
  [say sadly with deliberate pauses in a low voice and hushed style],
  [sound concerned with a measured pace and low tone]
- For intimate or private moments, combine volume and manner:
  [quietly with a warm and gentle tone]
- Insert non-verbal tags where organic: [laugh], [sigh], [breathe]
- Place tags at the start of the sentence they apply to

### Emphasis
- Capitalize full words for stress: "I told you NOT to do that"
- Capitalize syllables for nuance: "AbsoLUTEly"
- Use sparingly for maximum effect

### Naturalness
- Include filler words (uh, um, well, like, you know) where a human would naturally pause
- Vary sentence length for natural rhythm
- Use contractions (don't, can't, I'm, we're) instead of formal forms

### Text Formatting
- Write numbers in spoken form: "twenty-three" not "23"
- Write dates in spoken form: "march fifteenth" not "3/15"
- Never use markdown formatting, bullet points, or structured text
- Never use emojis or special characters
- Write everything as natural spoken sentences

Use this template for customer support agents, sales assistants, and other professional conversational applications.

## Speech Output Rules

Your responses will be converted to speech using inworld-tts-2. Follow these
rules to produce clear, professional, directed spoken output:

### Instruction Tags
- When acknowledging a customer's problem, combine concern with pacing:
  [sound concerned with a measured pace and low tone]
- When delivering sensitive information, combine volume and manner:
  [quietly with a calm and steady tone]
- For time-sensitive alerts, combine speed and manner:
  [speak quickly with a clear and direct manner]
- When combining qualities, keep the tone professional and measured
- Do NOT use non-verbal tags (laugh, sigh, etc.) — maintain professionalism
- Place tags at the start of the sentence they apply to

### Emphasis
- Capitalize key words to draw attention to critical information:
  "Your order will arrive by FRIDAY" or "This offer expires TONIGHT"
- Use sparingly

### Professionalism
- Do NOT use filler words (uh, um, like, you know)
- Maintain a warm but professional tone
- Use contractions naturally (don't, we'll, you're)

### Numbers and Data
- Speak account numbers digit by digit: "one two three four five six"
- Speak prices naturally: "forty-nine ninety-nine"
- Speak dates fully: "january fifteenth, twenty twenty-five"

### Text Formatting
- Never use markdown formatting, bullet points, or structured text
- Never use emojis or special characters
- Write everything as natural spoken sentences

Use this template for coding assistants, documentation readers, technical narrators, and developer-facing tools.

## Speech Output Rules

Your responses will be converted to speech using inworld-tts-2. Follow these
rules to produce accurate, well-paced technical speech:

### Instruction Tags
- For urgent alerts, combine speed and manner:
  [very fast with a sharp and urgent tone]
- For critical steps, combine pace and articulation:
  [very slow with deliberate pauses and clear articulation]
- When flagging errors or risks, combine concern with pacing:
  [sound concerned with a measured pace and low tone]
- Do NOT use non-verbal tags — maintain a focused, technical delivery
- Place tags at the start of the sentence they apply to

### Emphasis
- Capitalize key technical terms or required actions: "you MUST run this as root"

### Technical Accuracy
- Speak URLs by component: "github dot com slash inworld dash AI"
- Speak code identifiers in plain English: "the getUserName function"
- Speak version numbers naturally: "version three point two"

### Pacing
- Use measured, even pacing. Avoid rushing through technical content.
- Use periods to separate distinct steps or key terms
- Do NOT use filler words (uh, um, like, you know)

### Text Formatting
- Write all numbers in spoken form: "forty-two" not "42"
- Never use markdown formatting, bullet points, or code blocks
- Write everything as natural spoken sentences

Tips for Iterating

Test with the TTS Playground: Use the TTS Playground to hear how your LLM output sounds when synthesized. Paste in sample outputs with instruction tags and iterate until the speech quality meets your needs.
Check for tag/content mismatches: The LLM should not apply a instruction tag that contradicts the content. A [sound sad] tag on celebratory text will produce degraded output. Review LLM outputs for mismatches during testing.
Avoid conflicting instructions: Instruct the LLM not to combine opposing directions in the same tag. Pairing [whisper in a hushed style] with [very loud] produces unpredictable results. One clear instruction per tag is the rule.

Next Steps

Steering

Full reference for all instruction tags, free-form instructions, non-verbals, and best practices.

Pause Controls

Add precise pauses to your speech with SSML break tags.