Skip to main content
You open a persistent WebSocket connection and send text messages. The server streams audio chunks back over the same connection — no per-request overhead, no repeated handshakes. This gives you the lowest possible latency. Best for voice agents and interactive applications that send multiple synthesis requests in a session, where avoiding connection setup on every call makes a measurable difference.
If you only need a single request-response with chunked audio, the Streaming API is simpler to integrate. For tips on optimizing latency, see the latency best practices guide.

Timestamp Transport Strategy

When using timestamp alignment, you can choose how timestamps are delivered alongside audio using timestampTransportStrategy:
  • SYNC (default): Each chunk contains both audio and timestamps together.
  • ASYNC: Audio chunks arrive first, with timestamps following in separate trailing messages. This reduces time-to-first-audio with TTS 1.5 models.
See Timestamps for details on how each mode works.

Code Examples

JavaScript

View our JavaScript implementation example

Python

View our Python implementation example

API Reference

Synthesize Speech WebSocket

View the complete API specification

Next Steps

Voice Cloning Best Practices

Learn best practices for producing high-quality voice clones.

Speech Generation Best Practices

Learn best practices for synthesizing high-quality speech.

API Examples

Explore Python and JavaScript code examples for TTS integration.