> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Intro to Realtime STT

> Transcribe audio to text using leading STT providers through a single API.

The Realtime Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials.

The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.

<CardGroup cols={3}>
  <Card title="Developer Quickstart" icon="bolt" href="/stt/quickstart">
    Make your first STT API call and get a transcript.
  </Card>

  <Card title="API Reference" icon="code" href="/api-reference/sttAPI/speechtotext/transcribe">
    View the complete API specification.
  </Card>

  <Card title="Code Examples" icon="play" href="https://github.com/inworld-ai/inworld-api-examples/tree/main/stt">
    Browse ready-to-use GitHub samples for sync and real-time STT.
  </Card>
</CardGroup>

<Tip>
  **Using AI to code?** Paste `https://docs.inworld.ai/llms.txt` into your assistant so it knows every page on this site. Want live search? [Add the MCP server](https://docs.inworld.ai/tts/resources/vibe-coding).
</Tip>

## Supported Providers

### Inworld (first-party)

| **Model ID**            | **Endpoints**        | **Best for**                                                                                                                                                      |
| :---------------------- | :------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `inworld/inworld-stt-1` | Sync API + WebSocket | Voice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking |

Supports 30 languages. See [Language Support](/stt/languages) for the full list.

### Groq

| **Model ID**            | **Endpoints** | **Best for**                                     |
| :---------------------- | :------------ | :----------------------------------------------- |
| `groq/whisper-large-v3` | Sync API only | General-purpose transcription for recorded audio |

### AssemblyAI

| **Model ID**                                  | **Endpoints**  | **Best for**                                                                                                     |
| :-------------------------------------------- | :------------- | :--------------------------------------------------------------------------------------------------------------- |
| `assemblyai/universal-streaming-multilingual` | WebSocket only | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese)                                   |
| `assemblyai/universal-streaming-english`      | WebSocket only | English-optimized streaming                                                                                      |
| `assemblyai/u3-rt-pro`                        | WebSocket only | High-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese) |
| `assemblyai/whisper-rt`                       | WebSocket only | Real-time Whisper transcription                                                                                  |

<Note>
  AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon.
</Note>

### Soniox

| **Model ID**       | **Endpoints**  | **Best for**                                                                                                                                                                       |
| :----------------- | :------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `soniox/stt-rt-v4` | WebSocket only | High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support                                                                                     |
| `soniox/stt-rt-v5` | WebSocket only | Latest-generation real-time streaming with improved accuracy, speaker separation, faster semantic endpointing, and enhanced alphanumeric recognition (numbers, dates, IDs, emails) |

<Note>
  Soniox models currently support the WebSocket streaming endpoint only.
</Note>

### Deepgram

| **Model ID**                  | **Endpoints**  | **Best for**                                                                                                                                                                                             |
| :---------------------------- | :------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `deepgram/flux-general-en`    | WebSocket only | English conversational voice agents with built-in turn detection, interruption handling, and barge-in awareness                                                                                          |
| `deepgram/flux-general-multi` | WebSocket only | Multilingual conversational voice agents across 10 languages (English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, Dutch) with automatic language switching mid-conversation |

<Note>
  Deepgram models currently support the WebSocket streaming endpoint only.
</Note>

For pricing details, see [Billing](/portal/billing#stt) or [inworld.ai/pricing](https://inworld.ai/pricing).

## Model comparison

| **Feature**         | **inworld/inworld-stt-1**                                    | **groq/whisper-large-v3**                        | **assemblyai/universal-streaming-multilingual**                                | **assemblyai/universal-streaming-english** | **assemblyai/u3-rt-pro**                                                                                | **assemblyai/whisper-rt**                 | **soniox/stt-rt-v4**                                                                           | **soniox/stt-rt-v5**                                                                                 | **deepgram/flux-general-en**                                                               | **deepgram/flux-general-multi**                                                           |
| :------------------ | :----------------------------------------------------------- | :----------------------------------------------- | :----------------------------------------------------------------------------- | :----------------------------------------- | :------------------------------------------------------------------------------------------------------ | :---------------------------------------- | :--------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------- |
| Pricing             | [See pricing](https://inworld.ai/pricing)                    | [See pricing](https://inworld.ai/pricing)        | [See pricing](https://inworld.ai/pricing)                                      | [See pricing](https://inworld.ai/pricing)  | [See pricing](https://inworld.ai/pricing)                                                               | [See pricing](https://inworld.ai/pricing) | [See pricing](https://inworld.ai/pricing)                                                      | [See pricing](https://inworld.ai/pricing)                                                            | [See pricing](https://inworld.ai/pricing)                                                  | [See pricing](https://inworld.ai/pricing)                                                 |
| Endpoint            | Sync API + WebSocket                                         | Sync API only                                    | WebSocket only                                                                 | WebSocket only                             | WebSocket only                                                                                          | WebSocket only                            | WebSocket only                                                                                 | WebSocket only                                                                                       | WebSocket only                                                                             | WebSocket only                                                                            |
| Real-time streaming | <Icon icon="check" size={18} />                              | <Icon icon="xmark" size={18} />                  | <Icon icon="check" size={18} />                                                | <Icon icon="check" size={18} />            | <Icon icon="check" size={18} />                                                                         | <Icon icon="check" size={18} />           | <Icon icon="check" size={18} />                                                                | <Icon icon="check" size={18} />                                                                      | <Icon icon="check" size={18} />                                                            | <Icon icon="check" size={18} />                                                           |
| Best for            | Voice agents with Voice Profile and configurable turn-taking | General-purpose transcription for recorded audio | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | English-optimized streaming                | High-accuracy, sub-300ms multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | Real-time Whisper transcription           | High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support | Latest-generation streaming with improved accuracy, speaker separation, and alphanumeric recognition | English conversational voice agents with built-in turn detection and interruption handling | Multilingual conversational voice agents with automatic language switching (10 languages) |
| Languages           | 30 languages ([see list](/stt/languages))                    | 100+ (Whisper)                                   | 6 languages                                                                    | English                                    | 6 languages                                                                                             | 100+ (Whisper)                            | 60+ languages                                                                                  | 60+ languages                                                                                        | English                                                                                    | 10 languages                                                                              |

## Supported Audio Formats

| **Format**       | **Sync API**                    | **WebSocket Streaming**         |
| :--------------- | :------------------------------ | :------------------------------ |
| `LINEAR16` (PCM) | <Icon icon="check" size={18} /> | <Icon icon="check" size={18} /> |
| `MP3`            | <Icon icon="check" size={18} /> | <Icon icon="xmark" size={18} /> |
| `OGG_OPUS`       | <Icon icon="check" size={18} /> | <Icon icon="xmark" size={18} /> |
| `FLAC`           | <Icon icon="check" size={18} /> | <Icon icon="xmark" size={18} /> |
| `AUTO_DETECT`    | <Icon icon="check" size={18} /> | <Icon icon="xmark" size={18} /> |

Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG\_OPUS, WAV), `sampleRateHertz` is optional — the API auto-detects it from the file header.

<Note>
  Sync transcription accepts audio files up to **\~16 MB**. The actual duration depends on the encoding — for example, \~18 minutes of MP3 or \~8 minutes of 16 kHz 16-bit WAV. For larger files, split them into chunks or use the [WebSocket streaming endpoint](/api-reference/sttAPI/speechtotext/transcribe-stream-websocket).
</Note>

<Note>
  STT performs best with 16 kHz audio. Lower sample rates (such as 8 kHz telephony audio) contain fewer data points for the model to interpret, which reduces transcription accuracy. Upsampling low-sample-rate audio does not improve quality — it only interpolates between existing samples without adding new information.
</Note>

## Endpoints

| **Endpoint**                                                                                               | **Method** | **Description**                                                                  |
| :--------------------------------------------------------------------------------------------------------- | :--------- | :------------------------------------------------------------------------------- |
| [`/stt/v1/transcribe`](/api-reference/sttAPI/speechtotext/transcribe)                                      | POST       | Send complete audio, receive full transcript                                     |
| [`/stt/v1/transcribe:streamBidirectional`](/api-reference/sttAPI/speechtotext/transcribe-stream-websocket) | WebSocket  | Stream audio in real time, receive transcription chunks as they become available |

## Supported Languages

Language support depends on the STT provider. See [Language Support](/stt/languages) for the full list of languages supported by the Inworld first-party model, and links to third-party provider language documentation.

## Error Handling

Errors follow the standard [gRPC status](https://grpc.io/docs/guides/status-codes/) format.

**Authentication error**

```json theme={"system"}
{
  "code": 16,
  "message": "Unauthenticated: invalid or missing API key.",
  "details": []
}
```

**Invalid request**

```json theme={"system"}
{
  "code": 3,
  "message": "Unsupported audio encoding.",
  "details": []
}
```

**Common gRPC status codes**

| **Code** | **Name**             | **Description**                                                   |
| :------- | :------------------- | :---------------------------------------------------------------- |
| `3`      | `INVALID_ARGUMENT`   | Invalid or missing request field (encoding, model ID, audio data) |
| `8`      | `RESOURCE_EXHAUSTED` | Too many concurrent requests (rate limit)                         |
| `16`     | `UNAUTHENTICATED`    | Invalid or missing API key                                        |

## Best Practices

* **Model choice** — Use `inworld/inworld-stt-1` when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI/Soniox for specific latency/accuracy needs.
* **Audio** — Use MP3/OGG\_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
* **Streaming** — For Inworld model with manual turn-taking, send `endTurn` at each turn boundary and `closeStream` when done.
* **Speech events** — Listen for `speechStarted` and `speechStopped` events in the streaming response to detect when a speaker begins and stops talking. Use these to build custom turn-taking logic or visualize voice activity.
* **Voice Profile** — Set `voiceProfileConfig.enableVoiceProfile` to `true` and optionally adjust `topN` (default: 10) to control how many labels per category are returned.
* Test with sample audio and your target language before production.

## Troubleshooting

| **Issue**           | **What to check**                                                                                                                                             |
| :------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| No transcript       | API key, audio encoding matches request, valid audio file                                                                                                     |
| `UNAUTHENTICATED`   | `INWORLD_API_KEY` set correctly and not expired in Portal                                                                                                     |
| `INVALID_ARGUMENT`  | `audioEncoding` matches the actual format (LINEAR16 for raw PCM, MP3 for MP3, etc.)                                                                           |
| Poor quality        | Try a higher-accuracy model; use 16 kHz sample rate (8 kHz telephony audio has fewer data points and will produce lower-quality results); ensure clear speech |
| Large file failures | Split or compress (e.g. MP3/OGG\_OPUS); respect upload size limits                                                                                            |
| No Voice Profile    | Ensure `voiceProfileConfig.enableVoiceProfile` is set to `true` in your request; response may also omit it if the selected model does not support it          |

For more help, see the [Inworld Discord community](https://discord.gg/inworld).