Exclusive toDocumentation Index
Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
inworld-tts-2, Steering is a powerful new capability that brings realistic speech to life. Truly convincing audio depends not just on the words spoken, but on how they are delivered. Flat, mechanical voices break immersion and signal to listeners that they are interacting with a machine. Steering addresses this by letting you provide natural-language instructions that control how a voice performs, covering articulation, pitch, speed, volume, and more.
Think of it as giving direction to a voice actor. Wrap your instructions in square brackets and place them before the input text. No markup languages or numeric parameters required.
Free-form turn instructions
Describe the full character of a delivery in natural language, like a director coaching an actor before a take. A single instruction can capture emotion, energy, pacing, and intent all at once. The more fully you describe the delivery — layering mood, physical manner, and intent — the more precisely the voice will perform.[speak as if barely holding back rage — forcing every word through gritted teeth] I have told you. Repeatedly. And you STILL didn’t listen.
[overwhelmed with excitement and barely able to contain yourself] We just hit a million users. I still can’t believe it — we actually did it!
[slow and hushed with every word weighted by grief] I got the call this morning. He’s gone.
Metadata-based instructions
Single-property instructions that target one aspect of delivery at a time. The examples below are starting points — feel free to experiment with your own natural language phrasing.-
Articulation: How words are shaped and delivered — add force, crispness, or deliberate rhythm.
Examples:
[say with force][articulate clearly][say with deliberate pauses][articulate clearly] Each step must be followed in order. Do not skip ahead.
-
Intonation: Controls how pitch moves through a phrase — whether it lands decisively or stays open-ended.
Examples:
[say with a falling pitch][say with a rising pitch][say with a falling pitch] That’s my final answer. I’m not changing my mind.
-
Volume: The overall amplitude of the voice, from barely audible to a full, room-filling projection.
Examples:
[very loud][very quiet][very quiet] Don’t make a sound. There’s someone right outside the door.
-
Pitch: The baseline register of the voice — lower for gravity and weight, higher for energy and presence.
Examples:
[say in a low tone][say in a high pitch][say in a high pitch] We just got the green light, the product launches tomorrow!
-
Range: How much pitch varies across the utterance — flat for monotony, expressive for warmth or playfulness.
Examples:
[say playfully][say with no pitch variation][say playfully] So anyway, I was telling her about the trip, and she just laughed the whole time.
-
Speed: The speed of delivery — faster for urgency and tension, slower for weight and clarity.
Examples:
[very fast][very slow][very fast] Run, they’re right behind us, don’t stop, keep moving!
-
Vocal style: Changes how the voice itself sounds — shifting from normal speech into a different mode like whispering or singing.
Examples:
[sing joyfully][whisper in a hushed style][give a nasal quality][sing joyfully] The sun is rising and the world feels new, everything I dreamed is finally coming true.
Non-verbals
Insert organic, human sounds at any point in the text to add realism. Supported tags:[laugh] [breathe] [clear throat] [sigh] [cough] [yawn]
[clear throat] If I could have everyone’s attention, please.
I told him what happened, and he just [laugh] couldn’t believe it!
Emphasis
Capitalize letters within your input text to draw attention to specific words or syllables. Fully capitalizing a word stresses the entire word, while capitalizing individual letters within a word emphasizes a specific syllable.I told you NOT to open that door.
Are you seriously asking if I want pizza? AbsoLUTEly I do.
Best practices
Use free-form descriptive instructions for maximum control. The more you describe how you want the voice to perform, the better the output. A bare tag like[sad] gives the model one dimension to work with. A fuller instruction like [say sadly with deliberate pauses in a low voice and hushed style] combines mood, rhythm, pitch, and mode — producing a more nuanced and convincing performance.
Avoid conflicting instructions. Combining opposing directions, for example [whisper in a hushed style] and [very loud] in the same tag, produces unpredictable results. Use one clear instruction per tag.
Match the instruction to the text. The content being spoken should be consistent with the delivery style. A mismatch like [say sadly] applied to This is the happiest day of my life, I just landed my dream job and fell in love! sends contradictory signals and may degrade output quality.
Use one set of instructions per input. Tags that direct delivery, such as articulation, intonation, volume, pitch, range, speed, vocal style, or free-form performance instructions, should appear once at the start. Placing them midway through the text or using multiple instructions throughout will likely produce inconsistent results. Non-verbal tags like [laugh] or [sigh] are the exception and can be inserted inline where the sound should occur.
Avoid: [say in a low tone] I can't believe this happened. [say in a high pitch] Things are looking up though!
Model compatibility. Steering is supported exclusively on inworld-tts-2. It has no effect when used with other models.
Use pause controls for longer pauses. Use pause controls if you want to add longer pauses for added emphasis.