Skip to main content
Inworld’s text-to-speech (TTS) models offer ultra-realistic, context-aware speech synthesis and precise voice cloning capabilities, enabling developers to build natural and engaging experiences with human-like speech quality at an accessible price point. Our models can be accessed via API or the TTS Playground.

Models

Inworld TTS 1.5 Max

Our flagship model, delivering the best balance of quality and speed

  • Rich, expressive, contextually aware speech
  • Support for 15 languages
  • Optimized for real-time use (<200ms median latency)
  • High quality instant voice cloning

Inworld TTS 1.5 Mini

Our ultra-fast, most cost-efficient model. For when latency is the top priority.

  • Ultra-low latency (~120ms median latency)
  • Support for 15 languages
  • Radically affordable pricing
  • High quality instant voice cloning

Features

FeatureTTS-1.5-MaxTTS-1.5-Mini
Radically accessible pricing                $10/1M characters               $5/1M characters            
Quality                #1 ranked, maximum stability#1 ranked
P50 Latency                200 ms120 ms
Free instant voice cloning                
Professional voice cloning                
Custom pronunciation                
Multilingual                15 languages15 languages
Audio markups for emotion, style and non-verbals                
Timestamp alignment                
On-premise deployments