Concurrency Limits

Concurrency limits specify the guaranteed number of TTS generations that can be in progress at the same time for your account. They apply across both HTTP and WebSocket requests. Requests above these limits are best-effort based on available capacity.

Default Limits

Plan	Concurrent generations
On-Demand	5
Creator	10
Builder	50
Developer	150
Growth	500
Enterprise	Custom

How Concurrency Is Counted

The way a generation counts toward your limit depends on the protocol:

HTTP requests: Each in-flight request counts as one concurrent generation. The slot is held for the duration of the request. Exceeding the limit returns a 429 error.
WebSocket contexts: Each active context counts as one concurrent generation. A context is considered active while packets are being exchanged — for example, send_text from the client or audio chunks from the server. Once a context goes idle, it stops counting toward your limit. See Errors for details on how limit errors are surfaced within a WebSocket connection.

Concurrent Connections

Concurrent connections refer to the number of WebSocket connections your account can maintain simultaneously. This is a separate limit from concurrent generations, set at 10× your concurrent generation limit. Exceeding it returns an HTTP 429 error.

How Many Concurrent Generations Do You Actually Need?

For conversational use cases like voice agents, most applications need far fewer concurrent generations than expected. Because speech is generated much faster than it is spoken, a single generation slot can serve multiple conversations. The diagram below shows three simultaneous conversations and how their generation windows interact. Most generation windows are brief and staggered — only when two overlap does the concurrent generation count increase. The red 2 slots counted periods are the moments where two generation windows overlap — only these count as 2 concurrent generations. Outside those windows, a single slot handles all three conversations. We typically find that the number of simultaneous conversations you can handle is at least 4x your concurrent generation limit. Exact numbers will vary depending on your use case, conversation patterns, and response lengths.

Idle WebSocket Contexts

A WebSocket TTS context only counts toward your limit while it is active — that is, while packets are being exchanged (e.g., send_text from the client or audio chunks from the server). Closing a context releases its slot immediately once all audio chunks have been returned. If a context is left open, it becomes idle shortly after activity stops and no longer counts toward your limit.

Errors

When your account exceeds its concurrent generation limit, the affected request is rejected with an error JSON object containing code: 8:

{
  "error": {
    "code": 8,
    "message": "request failed: rpc error: code = ResourceExhausted desc = maximum allowed number of active WebSocket TTS contexts: 5 is reached",
    "details": []
  }
}

Your WebSocket connection and other active contexts are unaffected.

If you close and reopen contexts per agent turn, the error occurs on create_context — the context is not created. Any messages sent after this create_context may have been dropped.
If you keep contexts open while idle, the error can occur on any message that resumes activity (e.g., send_text, flush). Any messages sent since the first send_text after idle may have been dropped.

For retry strategies when hitting rate limits, see Handling rate-limited requests.

Get Started

Build with Realtime TTS

Best Practices

Resources

Default Limits

How Concurrency Is Counted

Concurrent Connections

How Many Concurrent Generations Do You Actually Need?

Idle WebSocket Contexts

Errors

​Default Limits

​How Concurrency Is Counted

​Concurrent Connections

​How Many Concurrent Generations Do You Actually Need?

​Idle WebSocket Contexts

​Errors

Default Limits

How Concurrency Is Counted

Concurrent Connections

How Many Concurrent Generations Do You Actually Need?

Idle WebSocket Contexts

Errors