Skip to main content

Voices

Inworld offers a variety of built-in voices across available languages that showcase a range of vocal characteristics and styles. These voices can be immediately tried out in TTS Playground and used in your applications. For greater customization, we recommend voice cloning. Create distinct, personalized voices tailored to your experience, with as little as 5 seconds of audio. Voices perform optimally when synthesizing text in the same language as the original voice. While cross-language synthesis is possible, you’ll achieve the best quality, pronunciation, and naturalness by matching the voice’s native language to your text content.

Language Support

Inworld’s models offer support for the following languages:
  • English (en)
  • Spanish (es)
  • French (fr)
  • Korean (ko)
  • Dutch (nl)
  • Chinese (zh)
  • German (de)
  • Italian (it)
  • Japanese (ja)
  • Polish (pl)
  • Portuguese (pt)
  • Russian (ru)
Languages marked with are experimental.
As a larger and more capable model, Inworld TTS Max is better suited for multilingual applications, offering better pronounciation, more accurate intonation, and more natural-sounding speech.

Supported Formats

Multiple audio formats are available via API to support different application requirements. The default is MP3.
  • MP3: Popular compressed format with broad device and platform compatibility.
    • Sample rate: 16kHz - 48kHz
    • Bit rates: 32kbps - 320kbps
  • Linear PCM: Uncompressed linear audio with WAV header, ideal for low-latency real-time applications to avoid encoding/decoding overhead.
    • Sample rate: 8kHz - 48kHz
    • Bit depth: 16-bit
  • Opus: High-quality compressed format optimized for low latency web and mobile applications.
    • Sample rate: 8kHz - 48kHz
    • Bit rates: 32kbps - 192kbps
  • μ-law: Compressed telephony format ideal for voice applications with bandwidth constraints.
    • Sample rate: 8kHz
  • A-law: Compressed telephony format ideal for voice applications with bandwidth constraints.
    • Sample rate: 8kHz

Additional Configurations

The following optional configurations can also be adjusted as needed when synthesizing audio:
  • Temperature: Controls the randomness of the output. Higher values will make the output more random and can lead to more expressive results. Lower values will make it more deterministic. For the most stable results, we recommend keeping the temperature between 0.6 and 1.
  • Talking Speed: Controls how fast the voice speaks. 1.0 is the normal native speed, while 0.5 is half the normal speed and 1.5 is 1.5x faster than the normal speed.
  • Emphasis Markers: Asterisks around a word (e.g. *really*) can be used to signal emphasis, prompting the voice to stress that word more strongly. This helps convey tone, intent, or emotion more clearly in spoken output.