Generating speech

This guide covers techniques and best practices for generating high-quality, natural-sounding speech for your applications.

If you’re using an LLM to generate text for TTS, see our dedicated guide on Prompting for TTS for prompt templates and techniques.

General Best Practices

Pick a suitable voice - Different voices will be better suited for different applications. Choose a voice that matches the emotional range and expression you’re looking for. For example, for a meditation app, select a more steady and calm voice. For an encouraging fitness coach, select a more expressive and excited voice.
Pay attention to punctuation - Punctuation matters! Use exclamation points (!) to make the voice more emphatic and excited. Use periods to insert natural pauses. Where possible, make sure to include punctuation at the end of the sentence.
Capitalize for emphasis - You can emphasize specific words by capitalizing them. For example, writing “We NEED a real vacation” will cause the voice to stress the word “need” when speaking, whereas “We need a REAL vacation” will emphasize the word “real”. This can help clarify tone or intent in nuanced dialogue.
Specify the language & localize the voice - Set the language field for the most consistent results when generating cross-lingual audio. For the most consistent, native-sounding speech, localize the voice for your target language.
Normalize complex text - If you find that the model is mispronouncing certain complex phrases like phone numbers or dollar amounts, it can help to normalize the text. Inworld can handle this automatically with server-side text normalization — just enable the applyTextNormalization field. It’s supported for a subset of languages; for others (e.g., Arabic, Hebrew, Korean, Russian), normalize the text yourself before sending it. Some examples of normalization include:
- Phone numbers: “(123)456-7891” -> “one two three, four five six, seven eight nine one”
- Dates: 5/6/2025 -> “may sixth twenty twenty five” (helpful since date formats may vary)
- Times: “12:55 PM” -> “twelve fifty-five PM”
- Emails: test@example.com -> “test at example dot com”
- Monetary values: $5,342.29 -> “five thousand three hundred and forty two dollars and twenty nine cents”
- Symbols: 2+2=4 -> “two plus two equals four”
Tune the Delivery Mode - For Realtime TTS-2, use the deliveryMode field to control the trade-off between stability and variability. STABLE produces more consistent, predictable output (best when the output must closely match the input). CREATIVE produces more varied speech with greater emotional range (useful for creative use cases like barks or demo clips). BALANCED (the default) sits in between.

Latency

For realtime use cases, minimizing latency is critical. Here are some tips and techniques you can use:

Stream TTS output - Instead of waiting for the entire generation (which may take some time if it is long), you can start playback as soon as the first chunk arrives so that the user doesn’t have to wait. Inworld’s websocket streaming should be the lowest-latency option, but streaming over HTTP will also be superior to a non-streaming setup.
Chunk TTS input - Instead of sending a large request to the TTS model (whether it’s pre-written or generated by an LLM), consider breaking it into sentence chunks and sending them one by one. The Inworld Agent Runtime provides built-in tools to handle this in a performant manner. For synthesizing text longer than 2,000 characters, see our ready-to-run scripts in the Long Text Input guide.

Advanced Tips

Natural, Conversational Speech

Natural human conversation is not perfect. It’s full of filler words, pauses, and other natural speech patterns that make it sound more human. Our TTS models are trained to generate the requested text as is, in order to produce the most accurate and consistent output that can be used for a wide range of applications. After all, not all applications want to have a bunch of filler words inserted into the speech! To generate natural, conversational speech, you can use the following techniques:

Insert filler words like uh, um, well, like, and you know in the text. For example, instead of:
I'm not too sure about that.
change it to:
Uh, I'm not uh too sure about that.
If the text is already being generated using an LLM, you can add instructions in the prompt to insert filler words in the response. Alternatively, you can use a small LLM to insert filler words given a piece of text.
Use pause controls to add pauses with SSML break tags like <break time="1s" />. These can help control pacing and make the speech sound more natural.
Use steering in Realtime TTS-2 to direct delivery in natural language (e.g., [speak conversationally with a relaxed pace]) and to insert non-verbals like [laugh], [sigh], or [breathe] inline where they would naturally occur.

Get Started

Build with Realtime TTS

Best Practices

Resources

General Best Practices

Latency

Advanced Tips

Natural, Conversational Speech

​General Best Practices

​Latency

​Advanced Tips

​Natural, Conversational Speech

General Best Practices

Latency

Advanced Tips

Natural, Conversational Speech