> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Intro to Realtime STT

> Transcribe audio to text using leading STT providers through a single API.

The Realtime Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials.

The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.

<CardGroup cols={3}>
  <Card title="Developer Quickstart" icon="bolt" href="/stt/quickstart">
    Make your first STT API call and get a transcript.
  </Card>

  <Card title="API Reference" icon="code" href="/api-reference/sttAPI/speechtotext/transcribe">
    View the complete API specification.
  </Card>

  <Card title="Code Examples" icon="play" href="https://github.com/inworld-ai/inworld-api-examples/tree/main/stt">
    Browse ready-to-use GitHub samples for sync and real-time STT.
  </Card>
</CardGroup>

## Supported Providers

### Inworld (first-party) — Experimental

| **Model ID**            | **Endpoints**        | **Best for**                                                                                                                                                      |
| :---------------------- | :------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `inworld/inworld-stt-1` | Sync API + WebSocket | Voice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking |

<Note>
  The Inworld first-party model is currently Experimental. Features and pricing are subject to change.
</Note>

Supports English plus 29 additional languages in experimental mode. See [Supported Languages](#supported-languages) for the full list.

### Groq

| **Model ID**            | **Endpoints** | **Best for**                                     |
| :---------------------- | :------------ | :----------------------------------------------- |
| `groq/whisper-large-v3` | Sync API only | General-purpose transcription for recorded audio |

### AssemblyAI

| **Model ID**                                  | **Endpoints**  | **Best for**                                                                                                     |
| :-------------------------------------------- | :------------- | :--------------------------------------------------------------------------------------------------------------- |
| `assemblyai/universal-streaming-multilingual` | WebSocket only | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese)                                   |
| `assemblyai/universal-streaming-english`      | WebSocket only | English-optimized streaming                                                                                      |
| `assemblyai/u3-rt-pro`                        | WebSocket only | High-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese) |
| `assemblyai/whisper-rt`                       | WebSocket only | Real-time Whisper transcription                                                                                  |

<Note>
  AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon.
</Note>

### Soniox

| **Model ID**       | **Endpoints**  | **Best for**                                                                                   |
| :----------------- | :------------- | :--------------------------------------------------------------------------------------------- |
| `soniox/stt-rt-v4` | WebSocket only | High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support |

<Note>
  Soniox models currently support the WebSocket streaming endpoint only.
</Note>

For pricing details, see [Billing](/stt/resources/billing) or [inworld.ai/pricing](https://inworld.ai/pricing).

## Model comparison

| **Feature**         | **inworld/inworld-stt-1**                                    | **groq/whisper-large-v3**                        | **assemblyai/universal-streaming-multilingual**                                | **assemblyai/universal-streaming-english** | **assemblyai/u3-rt-pro**                                                                                | **assemblyai/whisper-rt**                 | **soniox/stt-rt-v4**                                                                           |
| :------------------ | :----------------------------------------------------------- | :----------------------------------------------- | :----------------------------------------------------------------------------- | :----------------------------------------- | :------------------------------------------------------------------------------------------------------ | :---------------------------------------- | :--------------------------------------------------------------------------------------------- |
| Pricing             | [See pricing](https://inworld.ai/pricing)                    | [See pricing](https://inworld.ai/pricing)        | [See pricing](https://inworld.ai/pricing)                                      | [See pricing](https://inworld.ai/pricing)  | [See pricing](https://inworld.ai/pricing)                                                               | [See pricing](https://inworld.ai/pricing) | [See pricing](https://inworld.ai/pricing)                                                      |
| Endpoint            | Sync API + WebSocket                                         | Sync API only                                    | WebSocket only                                                                 | WebSocket only                             | WebSocket only                                                                                          | WebSocket only                            | WebSocket only                                                                                 |
| Real-time streaming | <Icon icon="check" size={18} />                              | <Icon icon="xmark" size={18} />                  | <Icon icon="check" size={18} />                                                | <Icon icon="check" size={18} />            | <Icon icon="check" size={18} />                                                                         | <Icon icon="check" size={18} />           | <Icon icon="check" size={18} />                                                                |
| Best for            | Voice agents with Voice Profile and configurable turn-taking | General-purpose transcription for recorded audio | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | English-optimized streaming                | High-accuracy, sub-300ms multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | Real-time Whisper transcription           | High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support |
| Languages           | English; 29 Experimental ([see below](#supported-languages)) | 100+ (Whisper)                                   | 6 languages                                                                    | English                                    | 6 languages                                                                                             | 100+ (Whisper)                            | Multilingual                                                                                   |

## Supported Audio Formats

| **Format**       | **Sync API**                    | **WebSocket Streaming**         |
| :--------------- | :------------------------------ | :------------------------------ |
| `LINEAR16` (PCM) | <Icon icon="check" size={18} /> | <Icon icon="check" size={18} /> |
| `MP3`            | <Icon icon="check" size={18} /> | <Icon icon="xmark" size={18} /> |
| `OGG_OPUS`       | <Icon icon="check" size={18} /> | <Icon icon="xmark" size={18} /> |
| `FLAC`           | <Icon icon="check" size={18} /> | <Icon icon="xmark" size={18} /> |
| `AUTO_DETECT`    | <Icon icon="check" size={18} /> | <Icon icon="xmark" size={18} /> |

Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG\_OPUS, WAV), `sampleRateHertz` is optional — the API auto-detects it from the file header.

<Note>
  STT performs best with 16 kHz audio. Lower sample rates (such as 8 kHz telephony audio) contain fewer data points for the model to interpret, which reduces transcription accuracy. Upsampling low-sample-rate audio does not improve quality — it only interpolates between existing samples without adding new information.
</Note>

## Endpoints

| **Endpoint**                                                                                               | **Method** | **Description**                                                                  |
| :--------------------------------------------------------------------------------------------------------- | :--------- | :------------------------------------------------------------------------------- |
| [`/stt/v1/transcribe`](/api-reference/sttAPI/speechtotext/transcribe)                                      | POST       | Send complete audio, receive full transcript                                     |
| [`/stt/v1/transcribe:streamBidirectional`](/api-reference/sttAPI/speechtotext/transcribe-stream-websocket) | WebSocket  | Stream audio in real time, receive transcription chunks as they become available |

## Supported Languages

Language support depends on the STT provider. See [Model comparison](#model-comparison) above for more details.

### Inworld first-party model (`inworld/inworld-stt-1`)

**Available:**

* English (en)

**Experimental:**

<Columns>
  <Column>
    * Spanish (es)
    * French (fr)
    * German (de)
    * Italian (it)
    * Portuguese (pt)
    * Dutch (nl)
    * Russian (ru)
    * Chinese (zh)
    * Japanese (ja)
    * Korean (ko)
    * Arabic (ar)
    * Hindi (hi)
    * Turkish (tr)
    * Polish (pl)
    * Swedish (sv)
  </Column>

  <Column>
    * Cantonese (yue)
    * Indonesian (id)
    * Thai (th)
    * Vietnamese (vi)
    * Malay (ms)
    * Danish (da)
    * Finnish (fi)
    * Czech (cs)
    * Filipino (fil)
    * Persian (fa)
    * Greek (el)
    * Hungarian (hu)
    * Macedonian (mk)
    * Romanian (ro)
  </Column>
</Columns>

Use `language` when you want to force recognition for a known language. Omit `language` to allow auto-detection when supported.

## Error Handling

Errors follow the standard [gRPC status](https://grpc.io/docs/guides/status-codes/) format.

**Authentication error**

```json theme={"system"}
{
  "code": 16,
  "message": "Unauthenticated: invalid or missing API key.",
  "details": []
}
```

**Invalid request**

```json theme={"system"}
{
  "code": 3,
  "message": "Unsupported audio encoding.",
  "details": []
}
```

**Common gRPC status codes**

| **Code** | **Name**             | **Description**                                                   |
| :------- | :------------------- | :---------------------------------------------------------------- |
| `3`      | `INVALID_ARGUMENT`   | Invalid or missing request field (encoding, model ID, audio data) |
| `8`      | `RESOURCE_EXHAUSTED` | Too many concurrent requests (rate limit)                         |
| `16`     | `UNAUTHENTICATED`    | Invalid or missing API key                                        |

## Best Practices

* **Model choice** — Use `inworld/inworld-stt-1` when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI/Soniox for specific latency/accuracy needs.
* **Audio** — Use MP3/OGG\_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
* **Streaming** — For Inworld model with manual turn-taking, send `endTurn` at each turn boundary and `closeStream` when done.
* **Speech events** — Listen for `speechStarted` and `speechStopped` events in the streaming response to detect when a speaker begins and stops talking. Use these to build custom turn-taking logic or visualize voice activity.
* **Voice Profile** — Set `voiceProfileConfig.enableVoiceProfile` to `true` and optionally adjust `topN` (default: 10) to control how many labels per category are returned.
* Test with sample audio and your target language before production.

## Troubleshooting

| **Issue**           | **What to check**                                                                                                                                             |
| :------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| No transcript       | API key, audio encoding matches request, valid audio file                                                                                                     |
| `UNAUTHENTICATED`   | `INWORLD_API_KEY` set correctly and not expired in Portal                                                                                                     |
| `INVALID_ARGUMENT`  | `audioEncoding` matches the actual format (LINEAR16 for raw PCM, MP3 for MP3, etc.)                                                                           |
| Poor quality        | Try a higher-accuracy model; use 16 kHz sample rate (8 kHz telephony audio has fewer data points and will produce lower-quality results); ensure clear speech |
| Large file failures | Split or compress (e.g. MP3/OGG\_OPUS); respect upload size limits                                                                                            |
| No Voice Profile    | Ensure `voiceProfileConfig.enableVoiceProfile` is set to `true` in your request; response may also omit it if the selected model does not support it          |

For more help, see the [Inworld Discord community](https://discord.gg/inworld).
