> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Developer Quickstart

> Make your first Realtime STT API request

In this quickstart, you'll send an audio file to the STT API and receive a transcript. It also highlights Inworld STT (`inworld/inworld-stt-1`), which adds Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking (automatic or manual).

## Make your first STT API request

<Steps titleSize="h3">
  <Step title="Create an API key">
    Create an [Inworld account](https://platform.inworld.ai/signup).

    In [Inworld Portal](https://platform.inworld.ai/), generate an API key by going to [**Settings** > **API Keys**](https://platform.inworld.ai/api-keys). Copy the Base64 credentials.

    Set your API key as an environment variable.

    <CodeGroup>
      ```shell macOS and Linux theme={"system"}
      export INWORLD_API_KEY='your-base64-api-key-here'
      ```

      ```shell Windows theme={"system"}
      setx INWORLD_API_KEY "your-base64-api-key-here"
      ```
    </CodeGroup>
  </Step>

  <Step title="Prepare an audio file">
    The STT API accepts audio in several formats (e.g. MP3, OGG\_OPUS, FLAC, LINEAR16). Audio bytes are sent in the request payload as a base64-encoded string — base64 is the transport encoding, not the audio format. Requirements vary by use case:

    | **Use case**       | **Format**                                   | **Notes**                                                                                |
    | :----------------- | :------------------------------------------- | :--------------------------------------------------------------------------------------- |
    | File upload (sync) | LINEAR16, MP3, OGG\_OPUS, FLAC, AUTO\_DETECT | Sample rate can be auto-detected from file headers when possible                         |
    | Streaming          | LINEAR16 (PCM)                               | Other encodings are not supported for streaming to minimize latency and preserve quality |

    **Recommended settings:**

    * Sample rate: 16,000 Hz (STT performs best at this rate; lower sample rates like 8 kHz contain fewer data points, reducing accuracy)
    * Bit depth: 16-bit (for LINEAR16)
    * Channels: Mono (1 channel)

    For file uploads (MP3, FLAC, OGG\_OPUS, WAV), `sampleRateHertz` is optional — the API can auto-detect it from the file header.

    <Note>
      Sync transcription accepts audio files up to **\~16 MB**. The actual duration depends on the encoding (e.g., \~18 minutes of MP3 or \~8 minutes of 16 kHz 16-bit WAV). For larger files, split them into chunks or use the WebSocket streaming endpoint.
    </Note>
  </Step>

  <Step title="Send the request">
    Audio is sent as a JSON payload with base64-encoded audio content. The API returns the complete transcript when processing is complete (and optionally Voice Profile, when returned by the API).

    Create a new file `inworld_stt_quickstart.py` or `inworld_stt_quickstart.js` and use the code below. The Inworld model (`inworld/inworld-stt-1`) provides transcription plus optional Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking for streaming.

    <CodeGroup>
      ```python Python theme={"system"}
      import requests
      import os
      import base64

      # Sync endpoint
      URL = "https://api.inworld.ai/stt/v1/transcribe"

      # Use a 16-bit PCM WAV file (16 kHz, mono)
      with open("input.wav", "rb") as f:
          audio_content = base64.b64encode(f.read()).decode("utf-8")

      payload = {
          "transcribeConfig": {
              "modelId": "inworld/inworld-stt-1",
              "language": "en",
              "audioEncoding": "LINEAR16",
              "voiceProfileConfig": {
                  "enableVoiceProfile": True,
              },
          },
          "audioData": {"content": audio_content},
      }

      headers = {
          "Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}",
          "Content-Type": "application/json",
      }

      response = requests.post(URL, headers=headers, json=payload)
      response.raise_for_status()
      result = response.json()

      print("Transcript:", result["transcription"]["transcript"])

      # Voice Profile (when returned by the API)
      if "voiceProfile" in result and result["voiceProfile"]:
          vp = result["voiceProfile"]
          if vp.get("age"):
              print("Age:", vp["age"].get("label"), vp["age"].get("confidence"))
          if vp.get("pitch"):
              print("Pitch:", vp["pitch"].get("label"), vp["pitch"].get("confidence"))
      ```

      ```javascript JavaScript theme={"system"}
      import fs from "fs";
      import fetch from "node-fetch";

      const URL = "https://api.inworld.ai/stt/v1/transcribe";
      // Use a 16-bit PCM WAV file (16 kHz, mono)
      const audioContent = fs.readFileSync("input.wav").toString("base64");

      const payload = {
        transcribeConfig: {
          modelId: "inworld/inworld-stt-1",
          language: "en",
          audioEncoding: "LINEAR16",
          voiceProfileConfig: {
            enableVoiceProfile: true,
          }
        },
        audioData: { content: audioContent }
      };

      const response = await fetch(URL, {
        method: "POST",
        headers: {
          Authorization: `Basic ${process.env.INWORLD_API_KEY}`,
          "Content-Type": "application/json"
        },
        body: JSON.stringify(payload)
      });

      const result = await response.json();
      console.log("Transcript:", result.transcription.transcript);

      if (result.voiceProfile) {
        const vp = result.voiceProfile;
        if (vp.age) console.log("Age:", vp.age.label, vp.age.confidence);
        if (vp.pitch) console.log("Pitch:", vp.pitch.label, vp.pitch.confidence);
      }
      ```
    </CodeGroup>
  </Step>

  <Step title="Review the response">
    The response includes the transcript and usage fields, plus optional `voiceProfile` when available.

    **Response (sync)**

    | Field                          | Description                                                                                                                                    |
    | :----------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------- |
    | `transcription.transcript`     | The transcribed text                                                                                                                           |
    | `transcription.isFinal`        | Whether the result is finalized                                                                                                                |
    | `transcription.wordTimestamps` | Per-word timing data (when available)                                                                                                          |
    | `usage`                        | Usage metrics for billing                                                                                                                      |
    | `voiceProfile`                 | (When returned) Age, pitch, emotion, vocalStyle, accent with `label` and `confidence`. Available with Inworld and supported third-party models |
  </Step>

  <Step title="Configuration parameters">
    **transcribeConfig**

    | Field                | Type    | Required | Description                                                                                                                                                                                                              |
    | :------------------- | :------ | :------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | `modelId`            | string  | Yes      | STT model ID. Use `inworld/inworld-stt-1` for WebSocket and HTTP                                                                                                                                                         |
    | `language`           | string  | No       | ISO 639 language code (e.g. `en`, `ja`). BCP-47 codes like `en-US` are also accepted and converted to the base language. If omitted, the model may auto-detect. See [Language Support](/stt/languages) for the full list |
    | `audioEncoding`      | string  | Yes      | One of: `LINEAR16`, `MP3`, `OGG_OPUS`, `FLAC`, `AUTO_DETECT`. For streaming, use `LINEAR16` only                                                                                                                         |
    | `sampleRateHertz`    | integer | No       | Sample rate in Hz. Default 16000. Can be omitted for formats with headers (MP3, FLAC, OGG\_OPUS, WAV)                                                                                                                    |
    | `numberOfChannels`   | integer | No       | Channel count. Default 1                                                                                                                                                                                                 |
    | `voiceProfileConfig` | object  | No       | Voice Profile configuration. See below                                                                                                                                                                                   |

    **voiceProfileConfig**

    | Field                | Type    | Required | Description                                              |
    | :------------------- | :------ | :------- | :------------------------------------------------------- |
    | `enableVoiceProfile` | bool    | Yes      | Set to `true` to enable Voice Profile analysis           |
    | `topN`               | integer | No       | Number of top labels per category to return. Default: 10 |

    **audioData**

    | Field     | Type   | Required | Description                |
    | :-------- | :----- | :------- | :------------------------- |
    | `content` | string | Yes      | Base64-encoded audio bytes |
  </Step>

  <Step title="Run the code">
    <CodeGroup>
      ```shell Python theme={"system"}
      pip install requests  # if needed
      python inworld_stt_quickstart.py
      ```

      ```shell JavaScript theme={"system"}
      npm install node-fetch  # if needed
      node inworld_stt_quickstart.js
      ```
    </CodeGroup>

    **Example output:**

    ```
    Transcript: Hey, I just wanted to check in on the delivery status for my order.
    ```
  </Step>
</Steps>

## Streaming (WebSocket)

For real-time microphone or live audio:

1. **First message** must contain `transcribeConfig` (same fields as above, including `voiceProfileConfig` to enable Voice Profile).
2. **Later messages** send `audioChunk` with base64-encoded LINEAR16 (PCM) audio only.
3. **Turn and stream end:**
   * To signal end of a speaker turn, send `endTurn`.
   * Send `closeStream` when the client is done sending audio (required for WebSocket; gRPC clients can just close the send side).

**Example first WebSocket message:**

```json theme={"system"}
{
  "transcribeConfig": {
    "modelId": "inworld/inworld-stt-1",
    "audioEncoding": "LINEAR16"
  }
}
```

Responses stream back as Transcription (interim and final), optional `voiceProfile`, speech events (`speechStarted` when voice activity is detected, `speechStopped` when silence is detected after speech), and finally Usage when the stream is closed.

**Streaming endpoint (WebSocket):** `wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional`

## Next Steps

<CardGroup cols={2}>
  <Card title="STT Overview" icon="microphone" href="/stt/overview">
    Learn about supported providers, audio formats, and endpoints.
  </Card>

  <Card title="API Reference" icon="code" href="/api-reference/sttAPI/speechtotext/transcribe">
    View the complete API specification.
  </Card>
</CardGroup>