Skip to main content
We enforce rate limits to ensure fair usage and stable performance for all users. Rate limits vary by subscription plan and are applied per account, shared across all API keys.

Rate limits by product

Rate limits and concurrency limits depend on your subscription plan. Please see the Pricing Page for detailed limits by tier:
  • TTS limits — Concurrent generations, WebSocket connections, voice design and cloning limits
  • STT limits — Streaming concurrency
  • Realtime API limits — Concurrent sessions
  • LLM Router limits — Concurrent generations, requests per second
Higher rate limits are available on higher-tier subscription plans — see the Pricing Page to compare tiers. If your use case requires limits beyond what any standard plan offers, please reach out to our team to discuss enterprise options.
Rate limits apply per account and are shared across your API keys.

Handling rate-limited requests

When you exceed your rate limit, the API returns an HTTP 429 Too Many Requests response. Your request is not processed — you need to wait and retry. Retrying immediately or in a tight loop will not help and can make the situation worse. If many clients retry at the same time (a “thundering herd”), they collectively sustain the overload and keep getting rejected. The standard solution is exponential backoff with jitter.

Exponential backoff with jitter

Exponential backoff increases the delay between retries: the first retry waits 1 second, the second waits 2 seconds, the third waits 4 seconds, and so on. Adding random jitter spreads out retries across clients so they don’t all hit the API at the same instant. The formula for each retry delay:
delay = min(base_delay × 2^attempt, max_delay) + random(0, jitter)
import time
import random
import requests

def request_with_backoff(method, url, max_retries=5, **kwargs):
    base_delay = 1
    max_delay = 30
    jitter = 1

    for attempt in range(max_retries + 1):
        response = requests.request(method, url, **kwargs)

        if response.status_code != 429:
            return response

        if attempt == max_retries:
            response.raise_for_status()

        delay = min(base_delay * (2 ** attempt), max_delay)
        delay += random.uniform(0, jitter)
        print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1}/{max_retries})")
        time.sleep(delay)

Best practices

  • Set a maximum retry count. Don’t retry forever — 5 retries with exponential backoff covers over 30 seconds of wait time. If the request still fails, surface the error to the caller.
  • Always add jitter. Without jitter, clients that hit the limit together will retry together, perpetuating the overload.
  • Log retry attempts. Include the attempt number and delay in your logs so you can identify rate-limiting patterns and adjust your request volume.
  • Reduce concurrent requests. If you’re consistently hitting limits, throttle your request rate or use a queue rather than relying on retries alone.