This guide covers techniques and best practices for generating high-quality, natural-sounding speech for your applications.Documentation Index
Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
If you’re using an LLM to generate text for TTS, see our dedicated guide on Prompting for TTS for prompt templates and techniques.
General Best Practices
- Pick a suitable voice - Different voices will be better suited for different applications. Choose a voice that matches the emotional range and expression you’re looking for. For example, for a meditation app, select a more steady and calm voice. For an encouraging fitness coach, select a more expressive and excited voice.
- Pay attention to punctuation - Punctuation matters! Use exclamation points (!) to make the voice more emphatic and excited. Use periods to insert natural pauses. Where possible, make sure to include punctuation at the end of the sentence.
- Capitalize for emphasis - You can emphasize specific words by capitalizing them. For example, writing “We NEED a real vacation” will cause the voice to stress the word “need” when speaking, whereas “We need a REAL vacation” will emphasize the word “real”. This can help clarify tone or intent in nuanced dialogue.
- Specify the language & localize the voice - Set the
languagefield for the most consistent results when generating cross-lingual audio. For the most consistent, native-sounding speech, localize the voice for your target language. - Normalize complex text - If you find that the model is mispronouncing certain complex phrases like phone numbers or dollar amounts, it can help to normalize the text. This may be particularly helpful for non-English languages. Some examples of normalization include:
- Phone numbers: “(123)456-7891” -> “one two three, four five six, seven eight nine one”
- Dates: 5/6/2025 -> “may sixth twenty twenty five” (helpful since date formats may vary)
- Times: “12:55 PM” -> “twelve fifty-five PM”
- Emails: test@example.com -> “test at example dot com”
- Monetary values: $5,342.29 -> “five thousand three hundred and forty two dollars and twenty nine cents”
- Symbols: 2+2=4 -> “two plus two equals four”
- Tune the Delivery Mode - For Realtime TTS-2, use the
deliveryModefield to control the trade-off between stability and variability.STABLEproduces more consistent, predictable output (best when the output must closely match the input).CREATIVEproduces more varied speech with greater emotional range (useful for creative use cases like barks or demo clips).BALANCED(the default) sits in between.
Latency
For realtime use cases, minimizing latency is critical. Here are some tips and techniques you can use:- Stream TTS output - Instead of waiting for the entire generation (which may take some time if it is long), you can start playback as soon as the first chunk arrives so that the user doesn’t have to wait. Inworld’s websocket streaming should be the lowest-latency option, but streaming over HTTP will also be superior to a non-streaming setup.
- Chunk TTS input - Instead of sending a large request to the TTS model (whether it’s pre-written or generated by an LLM), consider breaking it into sentence chunks and sending them one by one. The Inworld Agent Runtime provides built-in tools to handle this in a performant manner. For synthesizing text longer than 2,000 characters, see our ready-to-run scripts in the Long Text Input guide.
Advanced Tips
Natural, Conversational Speech
Natural human conversation is not perfect. It’s full of filler words, pauses, and other natural speech patterns that make it sound more human. Our TTS models are trained to generate the requested text as is, in order to produce the most accurate and consistent output that can be used for a wide range of applications. After all, not all applications want to have a bunch of filler words inserted into the speech! To generate natural, conversational speech, you can use the following techniques:- Insert filler words like
uh,um,well,like, andyou knowin the text. For example, instead of:change it to:If the text is already being generated using an LLM, you can add instructions in the prompt to insert filler words in the response. Alternatively, you can use a small LLM to insert filler words given a piece of text. - Use pause controls to add pauses with SSML break tags like
<break time="1s" />. These can help control pacing and make the speech sound more natural. - Use steering in Realtime TTS-2 to direct delivery in natural language (e.g.,
[speak conversationally with a relaxed pace]) and to insert non-verbals like[laugh],[sigh], or[breathe]inline where they would naturally occur.