Use the inworld-tts package to add speech synthesis, voice cloning, and voice design to your Python app
The inworld-tts Python SDK wraps the Inworld TTS REST API with a clean, Pythonic interface. It handles chunking for long text, retries with exponential backoff, and connection management automatically — reducing typical integrations from 30+ lines of raw HTTP to just a few lines of code.
from inworld_tts import InworldTTStts = InworldTTS() # reads INWORLD_API_KEY from envtts.generate( text="What a wonderful day to be a text-to-speech model!", voice="Ashley", output_file="output.mp3",)
Stream audio chunks over HTTP as they are generated. Lower time-to-first-audio than generate(). Text must be 2,000 characters or fewer.
chunks = []async for chunk in tts.stream( text="Streaming is great for real-time playback!", voice="Ashley",): chunks.append(chunk)audio = b"".join(chunks)
Parameters are the same as generate(), except text must be ≤2,000 characters and the default model is "inworld-tts-1.5-mini".Yields:bytes — audio chunks as they arrive.
Stream audio chunks, each paired with optional timestamp data. Text must be ≤2,000 characters.
async for chunk in tts.stream_with_timestamps( text="Streaming with timestamps!", voice="Ashley", timestamp_type="WORD",): # chunk.audio: bytes # chunk.timestamps: TimestampInfo | None pass
Takes all the same parameters as stream(), plus timestamp_type (required). Default model is "inworld-tts-1.5-mini".Yields: objects with audio: bytes and optional timestamps: TimestampInfo.
Clone a voice from one or more audio recordings. Only 5–15 seconds of audio is needed.
result = tts.clone_voice( audio_samples=["./recording.wav"], display_name="My Cloned Voice", lang="EN_US",)print(result.voice.voice_id) # use this ID in generate()
Parameter
Type
Required
Default
Description
audio_samples
list[bytes | str]
Yes
—
Audio files as bytes, or file paths. WAV or MP3.
display_name
str
No
"Cloned Voice"
Display name for the cloned voice.
lang
str
No
"EN_US"
Language code of the recordings.
transcriptions
list[str]
No
—
Transcriptions aligned with each audio sample. Improves clone quality.
description
str
No
—
Voice description.
tags
list[str]
No
—
Tags for filtering.
remove_background_noise
bool
No
False
Apply noise reduction before cloning.
Returns:CloneVoiceResult — the cloned voice ID is at result.voice.voice_id.
Design a new voice from a text description — no audio recording needed.
result = tts.design_voice( design_prompt="A warm, friendly female voice with a slight British accent", preview_text="Hello! Welcome to our application.", number_of_samples=3,)# Listen to previews, then publish the one you likechosen_voice = result.preview_voices[0]
Parameter
Type
Required
Default
Description
design_prompt
str
Yes
—
Natural-language description of the voice (30–250 characters).
preview_text
str
Yes
—
Text the generated voice will speak in the preview.
lang
str
No
"EN_US"
Language code.
number_of_samples
int
No
1
Number of preview candidates (1–3).
Returns:DesignVoiceResult — preview voices at result.preview_voices.
Migrate a voice from ElevenLabs to your Inworld workspace. Fetches the voice’s audio samples directly from ElevenLabs and clones them into Inworld. No ElevenLabs SDK required.
generate() and generate_with_timestamps() automatically chunk text longer than 2,000 characters and send chunks in parallel (controlled by max_concurrent_requests). The resulting audio is seamlessly concatenated, and timestamp offsets are merged correctly.stream() and stream_with_timestamps() require text of 2,000 characters or fewer. For longer text with streaming, split the text yourself and call stream() for each segment.