Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt

Use this file to discover all available pages before exploring further.

When an LLM generates text that gets fed into TTS, the default output often sounds flat and unnatural. With inworld-tts-2, you can go further: instruct the LLM to embed steering tags directly in its output. The result is speech that isn’t just well-formatted, but actively directed, with emotion, pacing, volume, and vocal style shaped by the LLM itself. This page covers what is new for inworld-tts-2. The guidance in Prompting for TTS still generally applies as a best practice, especially in cases where no steering instructions are applied.
Steering is available exclusively on inworld-tts-2 and does not apply to prior models.

Instructing the LLM to use steering

The Steering page documents all supported instruction tags across emotion, speed, volume, vocal style, tone, non-verbals, and free-form directions. To make your LLM use them, include a section in your system prompt that explains the tag format and lists the tags relevant to your use case. Prompt snippet:
Your responses will be spoken aloud using inworld-tts-2, which supports
instruction tags — natural language directions in square brackets placed before
the text they apply to.

Use instruction tags to match your delivery to the content. The following are
suggestions; natural language instructions can be used to describe the
appropriate delivery:
- Emotion: [say excitedly], [sound sad], [sound concerned], [sound terrified]
- Articulation: [say with force], [articulate clearly], [say with deliberate pauses]
- Intonation: [say with a falling pitch], [say with a rising pitch]
- Volume: [very quiet], [very loud]
- Pitch: [say in a low tone], [say in a high pitch]
- Range: [say playfully], [say with no pitch variation]
- Speed: [very fast], [very slow]
- Vocal style: [whisper in a hushed style], [give a nasal quality]
- Non-verbals: [laugh], [sigh], [clear throat], [breathe], [cough], [yawn]

For maximum control, combine qualities from multiple categories in a single
natural language instruction. A bare tag like [sound sad] gives the model one
dimension to work with. A fuller instruction like [say sadly with deliberate
pauses in a low voice and hushed style] layers mood, rhythm, pitch, and mode —
producing a more nuanced and convincing performance.

Place the tag at the start of the text it applies to. A single tag can apply
across multiple sentences; repeat or change tags only when the delivery should
change. Non-verbal tags can also be used inline where they occur. Do not
apply a tag that contradicts the content of the text. Avoid combining opposing
directions in the same tag — for example, [whisper in a hushed style] and
[very loud] together produce unpredictable results.
Before (no instruction tags):
I have great news. Your package has arrived.
After (with instruction tags):
[say excitedly with a high pitch and fast pace] I have great news. Your package has arrived!
For the full list of supported tags and examples, see the Steering page.

Example Prompt Templates

Below are complete, copyable system prompt blocks for common use cases. Each template combines steering with the text formatting guidance from Prompting for TTS.
Use this template for chatbots, AI companions, virtual friends, and other informal conversational applications.
## Speech Output Rules

Your responses will be converted to speech using inworld-tts-2. Follow these
rules to produce natural, expressive, directed spoken output:

### Instruction Tags
- Open with an instruction tag that captures the emotional quality of your
  response; combine mood, pitch, pacing, and manner for best results:
  [say excitedly with a high pitch and fast pace],
  [say sadly with deliberate pauses in a low voice and hushed style],
  [sound concerned with a measured pace and low tone]
- For intimate or private moments, combine volume and manner:
  [quietly with a warm and gentle tone]
- Insert non-verbal tags where organic: [laugh], [sigh], [breathe]
- Place tags at the start of the sentence they apply to

### Emphasis
- Capitalize full words for stress: "I told you NOT to do that"
- Capitalize syllables for nuance: "AbsoLUTEly"
- Use sparingly for maximum effect

### Naturalness
- Include filler words (uh, um, well, like, you know) where a human would naturally pause
- Vary sentence length for natural rhythm
- Use contractions (don't, can't, I'm, we're) instead of formal forms

### Text Formatting
- Write numbers in spoken form: "twenty-three" not "23"
- Write dates in spoken form: "march fifteenth" not "3/15"
- Never use markdown formatting, bullet points, or structured text
- Never use emojis or special characters
- Write everything as natural spoken sentences

Tips for Iterating

  • Test with the TTS Playground: Use the TTS Playground to hear how your LLM output sounds when synthesized. Paste in sample outputs with instruction tags and iterate until the speech quality meets your needs.
  • Check for tag/content mismatches: The LLM should not apply a instruction tag that contradicts the content. A [sound sad] tag on celebratory text will produce degraded output. Review LLM outputs for mismatches during testing.
  • Avoid conflicting instructions: Instruct the LLM not to combine opposing directions in the same tag. Pairing [whisper in a hushed style] with [very loud] produces unpredictable results. One clear instruction per tag is the rule.

Next Steps

Steering

Full reference for all instruction tags, free-form instructions, non-verbals, and best practices.

Pause Controls

Add precise pauses to your speech with SSML break tags.

Prompting for TTS

Prompt engineering techniques that apply to all Inworld TTS models.