Skip to main content
Inworld’s text-to-speech (TTS) models offer ultra-realistic, context-aware speech synthesis, zero data retention, and precise voice cloning capabilities, enabling developers to build natural and engaging experiences with human-like speech quality at an accessible price point. Our models can be accessed via API or the TTS Playground.

Developer quickstart

Learn how to make your first API call with a guided tutorial.

TTS Playground

Try different TTS models and voice cloning in TTS Playground.

Code Examples

Browse ready-to-use GitHub samples for common use cases.

Models

Inworld TTS 1.5 Max

Our flagship model, delivering the best balance of quality and speed

  • Rich, expressive, contextually aware speech
  • Support for 15 languages
  • Optimized for real-time use (<200ms median latency)
  • High quality instant voice cloning

Inworld TTS 1.5 Mini

Our ultra-fast, most cost-efficient model. For when latency is the top priority.

  • Ultra-low latency (~120ms median latency)
  • Support for 15 languages
  • Radically affordable pricing
  • High quality instant voice cloning

Features

FeatureTTS-1.5-MaxTTS-1.5-Mini
Radically accessible pricing                $10/1M characters               $5/1M characters            
Quality                #1 ranked, maximum stability#1 ranked
P50 Latency                200 ms120 ms
Free instant voice cloning                
Professional voice cloning                
Custom pronunciation                
Multilingual                15 languages15 languages
Audio markups for emotion, style and non-verbals                
Timestamp alignment                
On-premises deployments                
Zero data retention