Voices
Inworld offers a variety of built-in voices across available languages that showcase a range of vocal characteristics and styles. These voices can be immediately tried out in TTS Playground and used in your applications. For greater customization, we recommend voice cloning. Create distinct, personalized voices tailored to your experience, with as little as 5 seconds of audio. Voices perform optimally when synthesizing text in the same language as the original voice. While cross-language synthesis is possible, you’ll achieve the best quality, pronunciation, and naturalness by matching the voice’s native language to your text content.Language Support
Inworld’s models offer support for the following languages:- English (
en) - Spanish (
es) - French (
fr) - Korean (
ko) - Dutch (
nl) - Chinese (
zh) - German (
de) - Italian (
it) - Japanese (
ja) - Polish (
pl) - Portuguese (
pt) - Russian (
ru)
As a larger and more capable model, Inworld TTS Max is better suited for multilingual applications, offering better pronounciation, more accurate intonation, and more natural-sounding speech.
Supported Formats
Multiple audio formats are available via API to support different application requirements. The default is MP3.- MP3: Popular compressed format with broad device and platform compatibility.
- Sample rate: 16kHz - 48kHz
- Bit rates: 32kbps - 320kbps
- Linear PCM: Uncompressed linear audio with WAV header, ideal for low-latency real-time applications to avoid encoding/decoding overhead.
- Sample rate: 8kHz - 48kHz
- Bit depth: 16-bit
- Opus: High-quality compressed format optimized for low latency web and mobile applications.
- Sample rate: 8kHz - 48kHz
- Bit rates: 32kbps - 192kbps
- μ-law: Compressed telephony format ideal for voice applications with bandwidth constraints.
- Sample rate: 8kHz
- A-law: Compressed telephony format ideal for voice applications with bandwidth constraints.
- Sample rate: 8kHz
Additional Configurations
The following optional configurations can also be adjusted as needed when synthesizing audio:- Temperature: Controls the randomness of the output. Higher values will make the output more random and can lead to more expressive results. Lower values will make it more deterministic. For the most stable results, we recommend keeping the temperature between 0.6 and 1.
- Talking Speed: Controls how fast the voice speaks. 1.0 is the normal native speed, while 0.5 is half the normal speed and 1.5 is 1.5x faster than the normal speed.
- Emphasis Markers: Asterisks around a word (e.g.
*really*) can be used to signal emphasis, prompting the voice to stress that word more strongly. This helps convey tone, intent, or emotion more clearly in spoken output.