Developer Quickstart

In this quickstart, you’ll send an audio file to the STT API and receive a transcript. It also highlights Inworld STT (inworld/inworld-stt-1), which adds Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking (automatic or manual).

Make your first STT API request

Create an API key

Create an Inworld account.In Inworld Portal, generate an API key by going to Settings > API Keys. Copy the Base64 credentials.Set your API key as an environment variable.

export INWORLD_API_KEY='your-base64-api-key-here'

Prepare an audio file

The STT API accepts base64-encoded audio and supports multiple audio formats. Requirements vary by use case:

Use case	Format	Notes
File upload (sync)	LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT	Sample rate can be auto-detected from file headers when possible
Streaming	LINEAR16 (PCM)	Other encodings are not supported for streaming to minimize latency and preserve quality

Recommended settings:

Sample rate: 16,000 Hz (STT performs best at this rate; lower sample rates like 8 kHz contain fewer data points, reducing accuracy)
Bit depth: 16-bit (for LINEAR16)
Channels: Mono (1 channel)

For file uploads (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API can auto-detect it from the file header.

Send the request

Audio is sent as a JSON payload with base64-encoded audio content. The API returns the complete transcript when processing is complete (and optionally Voice Profile, when returned by the API).Create a new file inworld_stt_quickstart.py or inworld_stt_quickstart.js and use the code below. The Inworld model (inworld/inworld-stt-1) provides transcription plus optional Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking for streaming.

import requests
import os
import base64

# Sync endpoint
URL = "https://api.inworld.ai/stt/v1/transcribe"

# Use a 16-bit PCM WAV file (16 kHz, mono)
with open("input.wav", "rb") as f:
    audio_content = base64.b64encode(f.read()).decode("utf-8")

payload = {
    "transcribeConfig": {
        "modelId": "inworld/inworld-stt-1",
        "language": "en-US",
        "audioEncoding": "LINEAR16",
        "voiceProfileConfig": {
            "enableVoiceProfile": True,
        },
    },
    "audioData": {"content": audio_content},
}

headers = {
    "Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}",
    "Content-Type": "application/json",
}

response = requests.post(URL, headers=headers, json=payload)
response.raise_for_status()
result = response.json()

print("Transcript:", result["transcription"]["transcript"])

# Voice Profile (when returned by the API)
if "voiceProfile" in result and result["voiceProfile"]:
    vp = result["voiceProfile"]
    if vp.get("age"):
        print("Age:", vp["age"].get("label"), vp["age"].get("confidence"))
    if vp.get("pitch"):
        print("Pitch:", vp["pitch"].get("label"), vp["pitch"].get("confidence"))

Review the response

The response includes the transcript and usage fields, plus optional voiceProfile when available.Response (sync)

Field	Description
`transcription.transcript`	The transcribed text
`transcription.isFinal`	Whether the result is finalized
`transcription.wordTimestamps`	Per-word timing data (when available)
`usage`	Usage metrics for billing
`voiceProfile`	(When returned) Age, pitch, emotion, vocalStyle, accent with `label` and `confidence`. Available with Inworld and supported third-party models

Configuration parameters

transcribeConfig

Field	Type	Required	Description
`modelId`	string	Yes	STT model ID. Use `inworld/inworld-stt-1` for WebSocket and HTTP
`language`	string	No	BCP-47 language code (e.g. `en-US`). If omitted, the model may auto-detect. See Supported Languages for the full list
`audioEncoding`	string	Yes	One of: `LINEAR16`, `MP3`, `OGG_OPUS`, `FLAC`, `AUTO_DETECT`. For streaming, use `LINEAR16` only
`sampleRateHertz`	integer	No	Sample rate in Hz. Default 16000. Can be omitted for formats with headers (MP3, FLAC, OGG_OPUS, WAV)
`numberOfChannels`	integer	No	Channel count. Default 1
`voiceProfileConfig`	object	No	Voice Profile configuration. See below

voiceProfileConfig

Field	Type	Required	Description
`enableVoiceProfile`	bool	Yes	Set to `true` to enable Voice Profile analysis
`topN`	integer	No	Number of top labels per category to return. Default: 10

audioData

Field	Type	Required	Description
`content`	string	Yes	Base64-encoded audio bytes

Run the code

pip install requests  # if needed
python inworld_stt_quickstart.py

Example output:

Transcript: Hey, I just wanted to check in on the delivery status for my order.

Streaming (WebSocket)

For real-time microphone or live audio:

First message must contain transcribeConfig (same fields as above, including voiceProfileConfig to enable Voice Profile).
Later messages send audioChunk with base64-encoded LINEAR16 (PCM) audio only.
Turn and stream end:
- To signal end of a speaker turn, send endTurn.
- Send closeStream when the client is done sending audio (required for WebSocket; gRPC clients can just close the send side).

Example first WebSocket message:

{
  "transcribeConfig": {
    "modelId": "inworld/inworld-stt-1",
    "audioEncoding": "LINEAR16"
  }
}

Responses stream back as Transcription (interim and final), optional voiceProfile, speech events (speechStarted when voice activity is detected, speechStopped when silence is detected after speech), and finally Usage when the stream is closed. Streaming endpoint (WebSocket): wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional

Get Started

Resources

Make your first STT API request

Create an API key

Prepare an audio file

Send the request

Review the response

Configuration parameters

Run the code

Streaming (WebSocket)

Next Steps

STT Overview

API Reference

Get Started

Resources

Documentation Index

​Make your first STT API request

Create an API key

Prepare an audio file

Send the request

Review the response

Configuration parameters

Run the code

​Streaming (WebSocket)

​Next Steps

STT Overview

API Reference

Make your first STT API request

Streaming (WebSocket)

Next Steps