Synthesize Speech (WebSocket) - Inworld AI Documentation

You open a persistent WebSocket connection and send text messages. The server streams audio chunks back over the same connection — no per-request overhead, no repeated handshakes. This gives you the lowest possible latency. Best for voice agents and interactive applications that send multiple synthesis requests in a session, where avoiding connection setup on every call makes a measurable difference.

If you only need a single request-response with chunked audio, the Streaming API is simpler to integrate. For tips on optimizing latency, see the latency best practices guide.

Timestamp Transport Strategy

When using timestamp alignment, you can choose how timestamps are delivered alongside audio using timestampTransportStrategy:

SYNC (default): Each chunk contains both audio and timestamps together.
ASYNC: Audio chunks arrive first, with timestamps following in separate trailing messages. This reduces time-to-first-audio with TTS 1.5 models.

See Timestamps for details on how each mode works.

Code Examples

JavaScript

View our JavaScript implementation example

Python

View our Python implementation example

API Reference