Synthesize Speech (Streaming) - Inworld AI Documentation

You send text, the server returns audio chunks over HTTP as they are generated. Playback can begin before the full synthesis is complete, significantly reducing time-to-first-audio. Best for real-time applications, conversational AI, and long-form content — anywhere you want low-latency playback without managing a persistent connection.

For even lower latency with multiple requests in a session, consider the WebSocket API. For tips on optimizing latency, see the latency best practices guide.

Timestamp Transport Strategy

When using timestamp alignment, you can choose how timestamps are delivered alongside audio using timestampTransportStrategy:

SYNC (default): Each chunk contains both audio and timestamps together.
ASYNC: Audio chunks arrive first, with timestamps following in separate trailing messages. This reduces time-to-first-audio with TTS 1.5 models.

See Timestamps for details on how each mode works.

Code Examples

JavaScript

View our JavaScript implementation example

Python

View our Python implementation example

API Reference