To celebrate the launch of Inworld TTS, for a limited time, all usage in TTS Playground is free.
Inworld’s text-to-speech (TTS) models offer ultra-realistic, context-aware speech synthesis and precise voice cloning capabilities, enabling developers to build natural and engaging experiences with human-like speech quality at an accessible price point. Our models can be accessed via API or the TTS Playground.

Models

Inworld TTS

Our flagship model, offering cost-efficient, ultra-realistic speech

  • Preview, available on Portal and API
  • Rich, expressive speech and low-latency
  • Support for 11 languages
  • Experimental support for audio markups

Inworld TTS Max

Our more powerful and capable model, available experimentally

  • Experimental, not optimized for real-time use
  • More expressive, contextually aware speech
  • Stronger multilingual capabilities
  • Experimental support for audio markups

Features

 Available in Preview
Radically accessible pricing                See Pricing
State-of-the-art quality                
Real-time latency                
Free instant (zero-shot) voice cloning                
Professional voice cloning                
Multilingual                
Crosslingual (using the same voice across multiple languages)                
Audio markups for emotion, style and non-verbals                
Multiple model sizes for every use case                
Embedded safeguards               
SOC2 Type II                
On-premise deployments                
Open-Source training & modeling code                

Voices

Inworld offers a variety of built-in voices across available languages that showcase a range of vocal characteristics and styles. These voices can be immediately tried out in TTS Playground and used in your applications. For greater customization, we recommend voice cloning. Create distinct, personalized voices tailored to your experience, with as little as 5 seconds of audio. Voices perform optimally when synthesizing text in the same language as the original voice. While cross-language synthesis is possible, you’ll achieve the best quality, pronunciation, and naturalness by matching the voice’s native language to your text content.

Language Support

Inworld’s models offer support for the following languages:
  • English (en)
  • Spanish (es)
  • French (fr)
  • Korean (ko)
  • Dutch (nl)
  • Chinese (zh)
  • German (de)
  • Italian (it)
  • Japanese (ja)
  • Polish (pl)
  • Portuguese (pt)
Languages marked with are experimental.
As a larger and more capable model, Inworld TTS Max is better suited for multilingual applications, offering better pronounciation, more accurate intonation, and more natural-sounding speech.

Audio Markups

This feature is currently experimental and only supports English.
Audio markups give you a new level of control over how the model speaks, not just what it says. These markups can be used to control emotional expression, delivery style, and non-verbal vocalizations.

Emotion and Delivery Style

Emotion and delivery style markups control the way a given text is spoken. These work best when used at the beginning of a text and apply to the text that follows.
  • Emotion: [happy], [sad], [angry], [surprised], [fearful], [disgusted]
  • Delivery Style: [laughing], [whispering]
For example:
[happy] I can't believe this is happening.
For best results, use only one emotion or delivery style markup at the beginning of your text. Using multiple emotion and delivery style markups or placing them mid-text may produce mixed results. Instead, we recommend splitting up the text into separate requests, with all markups placed at the start of the text. See our Best Practices guide for more details.

Non-verbal Vocalization

Non-verbal vocalization markups add in non-verbal sounds based on where they are placed in the text.
  • [breathe], [clear_throat], [cough], [laugh], [sigh], [yawn]
For example:
[clear_throat] Did you hear what I said? [sigh] You never listen to me!
Multiple non-verbal vocalizations can be used within a single piece of text to add in the appropriate vocal effects throughout the speech.

Supported Formats

Multiple audio formats are available via API to support different application requirements. The default is MP3.
  • MP3: Popular compressed format with broad device and platform compatibility.
    • Sample rate: 16kHz - 48kHz
    • Bit rates: 32kbps - 320kbps
  • Linear PCM: Uncompressed linear audio with WAV header, ideal for low-latency real-time applications to avoid encoding/decoding overhead.
    • Sample rate: 8kHz - 48kHz
    • Bit depth: 16-bit
  • Opus: High-quality compressed format optimized for low latency web and mobile applications.
    • Sample rate: 8kHz - 48kHz
    • Bit rates: 32kbps - 192kbps
  • μ-law: Compressed telephony format ideal for voice applications with bandwidth constraints.
    • Sample rate: 8kHz
  • A-law: Compressed telephony format ideal for voice applications with bandwidth constraints.
    • Sample rate: 8kHz

Additional Configurations

The following optional configurations can also be adjusted as needed when synthesizing audio:
  • Temperature: Controls the randomness of the output. Higher values will make the output more random and can lead to more expressive results. Lower values will make it more deterministic. For the most stable results, we recommend keeping the temperature between 0.6 and 1.
  • Pitch: Adjusts how high or low the voice sounds. Negative values make the voice deeper/lower, while positive values make it higher/squeakier.
  • Talking Speed: Controls how fast the voice speaks. 1.0 is the normal native speed, while 0.5 is half the normal speed and 1.5 is 1.5x faster than the normal speed.

Next Steps

Ready to start exploring? Check out the links below to get started.