Python SDK - Inworld AI Documentation

The inworld-tts Python SDK wraps the Inworld TTS REST API with a clean, Pythonic interface. It handles chunking for long text, retries with exponential backoff, and connection management automatically — reducing typical integrations from 30+ lines of raw HTTP to just a few lines of code.

pip install inworld-tts

Requires Python 3.10+.

Quick Start

from inworld_tts import InworldTTS

tts = InworldTTS()  # reads INWORLD_API_KEY from env

tts.generate(
    text="What a wonderful day to be a text-to-speech model!",
    voice="Ashley",
    output_file="output.mp3",
)

Speech Synthesis

`generate(options)`

Synthesize speech and return the complete audio as bytes. Text longer than 2,000 characters is automatically chunked and sent in parallel.

audio = tts.generate(
    text="Hello, world!",
    voice="Ashley",
    model="inworld-tts-2",
    encoding="MP3",
    output_file="output.mp3",  # optional — also writes to disk
)

Parameter	Type	Required	Default	Description
`text`	`str`	Yes	—	Text to synthesize. Any length. Supports `<break time="Xs"/>` SSML.
`voice`	`str`	Yes	—	Voice ID (e.g. `"Ashley"`, `"Dennis"`, or a custom voice ID).
`model`	`str`	No	`"inworld-tts-2"`	Model ID.
`encoding`	`str`	No	`"MP3"`	Audio format: `MP3`, `OGG_OPUS`, `FLAC`, `LINEAR16`, `WAV`, `PCM`, `ALAW`, `MULAW`.
`sample_rate`	`int`	No	`48000`	Sample rate in Hz.
`bit_rate`	`int`	No	`128000`	Bit rate in bps (MP3 / OGG_OPUS only).
`speaking_rate`	`float`	No	`1.0`	Speed multiplier (0.5–1.5).
`language`	`str`	No	—	BCP-47 language tag (e.g. `"en-US"`, `"fr-FR"`) telling the model which language the voice should speak. Auto-detected from the input text when omitted.
`delivery_mode`	`str`	No	`"BALANCED"`	Trade-off between stability and expressiveness on `inworld-tts-2`: `"STABLE"`, `"BALANCED"`, or `"CREATIVE"`. Ignored on other models.
`temperature`	`float`	No	`1.0`	Expressiveness (0.0–2.0). Higher = more expressive. Ignored on `inworld-tts-2` — use `delivery_mode` instead.
`output_file`	`str`	No	—	Write audio to this file path.
`play`	`bool`	No	`False`	Play audio immediately after synthesis.

Returns: bytes — raw audio bytes in the requested encoding.

`stream(options)`

Stream audio chunks over HTTP as they are generated. Lower time-to-first-audio than generate(). Text must be 2,000 characters or fewer.

chunks = []

async for chunk in tts.stream(
    text="Streaming is great for real-time playback!",
    voice="Ashley",
):
    chunks.append(chunk)

audio = b"".join(chunks)

Parameters are the same as generate(), except text must be ≤2,000 characters and the default model is "inworld-tts-1.5-mini". Yields: bytes — audio chunks as they arrive.

`generate_with_timestamps(options)`

Same as generate() but also returns word- or character-level timing data. Useful for lip-sync, karaoke, and subtitle alignment.

result = tts.generate_with_timestamps(
    text="Timestamps are useful for lip sync.",
    voice="Ashley",
    timestamp_type="WORD",
)

# result.audio → bytes
# result.timestamps.word_alignment.words → ["Timestamps", "are", "useful", ...]
# result.timestamps.word_alignment.word_start_time_seconds → [0.0, 0.42, 0.61, ...]

Takes all the same parameters as generate(), plus:

Parameter	Type	Required	Description
`timestamp_type`	`"WORD"` \| `"CHARACTER"`	Yes	`"WORD"` returns word timing, phonemes, and visemes. `"CHARACTER"` returns per-character timing.

Returns: an object with audio: bytes and timestamps: TimestampInfo.

`stream_with_timestamps(options)`

Stream audio chunks, each paired with optional timestamp data. Text must be ≤2,000 characters.

async for chunk in tts.stream_with_timestamps(
    text="Streaming with timestamps!",
    voice="Ashley",
    timestamp_type="WORD",
):
    # chunk.audio: bytes
    # chunk.timestamps: TimestampInfo | None
    pass

Takes all the same parameters as stream(), plus timestamp_type (required). Default model is "inworld-tts-1.5-mini". Yields: objects with audio: bytes and optional timestamps: TimestampInfo.

`play(audio, options)`

Play audio from bytes or a file path. Encoding is auto-detected from magic bytes unless overridden.

audio = tts.generate(text="Listen to this!", voice="Ashley")
tts.play(audio)

# Or play from a file
tts.play("output.mp3")

Parameter	Type	Required	Default	Description
`audio`	`bytes` \| `str`	Yes	—	Raw audio bytes or a file path.
`encoding`	`str`	No	auto-detected	Format hint (`"MP3"`, `"WAV"`, etc.). Inferred from extension for file paths.

Voice Management

`list_voices(options)`

List available voices, optionally filtered by language.

voices = tts.list_voices()

# Filter by language
en_voices = tts.list_voices(lang="EN_US")
multi_lang = tts.list_voices(lang=["EN_US", "ES_ES"])

Parameter	Type	Required	Description
`lang`	`str` \| `list[str]`	No	Filter by language code(s). Returns all voices when omitted.

Returns: list[VoiceInfo]

`get_voice(voice)`

Get details for a single voice. Works with custom voices in your workspace (cloned or designed voices).

voice = tts.get_voice("my-custom-voice-id")
# voice.voice_id, voice.display_name, voice.lang_code, ...

Returns: VoiceInfo

`clone_voice(options)`

Clone a voice from one or more audio recordings. Only 5–15 seconds of audio is needed.

result = tts.clone_voice(
    audio_samples=["./recording.wav"],
    display_name="My Cloned Voice",
    lang="EN_US",
)

print(result.voice.voice_id)  # use this ID in generate()

Parameter	Type	Required	Default	Description
`audio_samples`	`list[bytes \| str]`	Yes	—	Audio files as `bytes`, or file paths. WAV or MP3.
`display_name`	`str`	No	`"Cloned Voice"`	Display name for the cloned voice.
`lang`	`str`	No	`"EN_US"`	Language code of the recordings.
`transcriptions`	`list[str]`	No	—	Transcriptions aligned with each audio sample. Improves clone quality.
`description`	`str`	No	—	Voice description.
`tags`	`list[str]`	No	—	Tags for filtering.
`remove_background_noise`	`bool`	No	`False`	Apply noise reduction before cloning.

Returns: CloneVoiceResult — the cloned voice ID is at result.voice.voice_id.

`design_voice(options)`

Design a new voice from a text description — no audio recording needed.

result = tts.design_voice(
    design_prompt="A warm, friendly female voice with a slight British accent",
    preview_text="Hello! Welcome to our application.",
    number_of_samples=3,
)

# Listen to previews, then publish the one you like
chosen_voice = result.preview_voices[0]

Parameter	Type	Required	Default	Description
`design_prompt`	`str`	Yes	—	Natural-language description of the voice (30–250 characters).
`preview_text`	`str`	Yes	—	Text the generated voice will speak in the preview.
`lang`	`str`	No	`"EN_US"`	Language code.
`number_of_samples`	`int`	No	`1`	Number of preview candidates (1–3).

Returns: DesignVoiceResult — preview voices at result.preview_voices.

`publish_voice(options)`

Publish a designed or cloned voice preview to your library so it can be used in generate() and stream().

voice = tts.publish_voice(
    voice=chosen_voice.voice_id,
    display_name="My Designed Voice",
)

Parameter	Type	Required	Description
`voice`	`str`	Yes	Voice ID from `design_voice()` or `clone_voice()`.
`display_name`	`str`	No	Display name for the published voice.
`description`	`str`	No	Description.
`tags`	`list[str]`	No	Tags for filtering.

Returns: VoiceInfo

`migrate_from_elevenlabs(options)`

Migrate a voice from ElevenLabs to your Inworld workspace. Fetches the voice’s audio samples directly from ElevenLabs and clones them into Inworld. No ElevenLabs SDK required.

import os

result = tts.migrate_from_elevenlabs(
    eleven_labs_api_key=os.environ["ELEVEN_LABS_API_KEY"],
    eleven_labs_voice_id="abc123",
)

print(f'Migrated "{result.eleven_labs_name}" → {result.inworld_voice_id}')

Parameter	Type	Required	Description
`eleven_labs_api_key`	`str`	Yes	Your ElevenLabs API key.
`eleven_labs_voice_id`	`str`	Yes	ElevenLabs voice ID to migrate.

Returns: an object with eleven_labs_voice_id, eleven_labs_name, and inworld_voice_id.

Configuration

Create a client with InworldTTS():

from inworld_tts import InworldTTS

tts = InworldTTS()                   # reads INWORLD_API_KEY from env
tts = InworldTTS(api_key="your_key") # or pass explicitly

Option	Type	Required	Default	Description
`api_key`	`str`	—	`INWORLD_API_KEY` env var	Inworld API key.
`base_url`	`str`	No	`https://api.inworld.ai`	Override the API base URL.
`timeout`	`int`	No	`120`	Global HTTP timeout in seconds.
`max_retries`	`int`	No	`2`	Retry attempts on `NetworkError` or 5xx. Uses exponential backoff (1s, 2s, 4s… capped at 16s). `0` disables retries.
`max_concurrent_requests`	`int`	No	`4`	Max parallel chunk requests for long-text `generate()`.
`debug`	`bool`	No	`False`	Enable debug logging. Also activated by `DEBUG=inworld-tts` env var.

api_key must be provided directly or through the INWORLD_API_KEY environment variable. If neither is set, a MissingApiKeyError is thrown.

Long Text

generate() and generate_with_timestamps() automatically chunk text longer than 2,000 characters and send chunks in parallel (controlled by max_concurrent_requests). The resulting audio is seamlessly concatenated, and timestamp offsets are merged correctly. stream() and stream_with_timestamps() require text of 2,000 characters or fewer. For longer text with streaming, split the text yourself and call stream() for each segment.

Error Handling

The SDK exports three error classes, all extending InworldTTSError:

from inworld_tts import (
    InworldTTS,
    InworldTTSError,
    ApiError,
    NetworkError,
    MissingApiKeyError,
)

tts = InworldTTS()  # reads INWORLD_API_KEY from env

try:
    audio = tts.generate(text="Hello!", voice="Ashley")
except MissingApiKeyError:
    # No API key provided
    pass
except ApiError as err:
    print(f"HTTP {err.code}: {err.message}", err.details)
except NetworkError as err:
    print(f"Network error: {err}")

Error	When
`MissingApiKeyError`	No `api_key` was provided and `INWORLD_API_KEY` is not set.
`ApiError`	The API returned a 4xx or 5xx response. Includes `.code` (HTTP status) and `.details`.
`NetworkError`	Connection failure or timeout. Automatically retried up to `max_retries` times before throwing.

Next Steps

Voice Cloning

Create a personalized voice clone with just 5 seconds of audio.

Best Practices

Learn tips and tricks for synthesizing high-quality speech.

API Reference

View the complete TTS API specification.

Documentation Index

​Quick Start

​Speech Synthesis

​generate(options)

​stream(options)

​generate_with_timestamps(options)

​stream_with_timestamps(options)

​play(audio, options)

​Voice Management

​list_voices(options)

​get_voice(voice)

​clone_voice(options)

​design_voice(options)

​publish_voice(options)

​migrate_from_elevenlabs(options)

​Configuration

​Long Text

​Error Handling

​Next Steps

Voice Cloning

Best Practices

API Reference

Quick Start

Speech Synthesis

`generate(options)`

`stream(options)`

`generate_with_timestamps(options)`

`stream_with_timestamps(options)`

`play(audio, options)`

Voice Management

`list_voices(options)`

`get_voice(voice)`

`clone_voice(options)`

`design_voice(options)`

`publish_voice(options)`

`migrate_from_elevenlabs(options)`

Configuration

Long Text

Error Handling

Next Steps