{
"modelId": "assemblyai/universal-streaming-multilingual",
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000,
"language": "en-US"
}{
"content": "<YOUR_AUDIO>"
}{}{}{
"transcript": "Hello, this is a test transcription.",
"isFinal": true,
"wordTimestamps": []
}{
"transcribedAudioMs": 123,
"modelId": "<string>"
}{
"startTimeMs": 1250,
"confidence": 0.95
}{
"silenceDurationMs": 750
}Transcribe audio (WebSocket)
Bidirectional streaming API for real-time speech-to-text transcription over WebSocket.
This method listens for streaming audio input and returns recognized text chunks one by one as soon as they are ready. Audio chunks are expected to be a part of a single voice input. Suitable for streaming live conversations, microphone input, or other streaming audio sources.
To use the API:
- Send a
transcribeConfigmessage first to configure the session (model, language, audio encoding, etc.). - Stream
audioChunkmessages containing raw audio bytes. - Receive
transcriptionresults as they become available, including both interim (partial) and final results. - Listen for
speechStartedandspeechStoppedevents to detect voice activity changes. - Optionally send
endTurnto signal end of a speaker’s turn. - Send
closeStreamwhen done.
Documentation Index
Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
{
"modelId": "assemblyai/universal-streaming-multilingual",
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000,
"language": "en-US"
}{
"content": "<YOUR_AUDIO>"
}{}{}{
"transcript": "Hello, this is a test transcription.",
"isFinal": true,
"wordTimestamps": []
}{
"transcribedAudioMs": 123,
"modelId": "<string>"
}{
"startTimeMs": 1250,
"confidence": 0.95
}{
"silenceDurationMs": 750
}Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_API_KEY
Configure the transcription session. Must be the first message sent. Contains model selection, audio format settings, and optional feature configurations.
Send a chunk of audio data for transcription. Must be sent after the initial transcribe config message.
Signal the end of a speaker's turn. Some providers do not support manual turn-taking; for those providers, sending this message will have no effect.
Signal that the client is done sending audio data. Required for HTTP/WebSocket clients since there is no equivalent to gRPC stream close.
Transcription result streamed back as audio is processed. May be an interim (partial) result or a final result depending on the isFinal field.
Usage metrics for billing and monitoring purposes. Coming soon — this field is not yet populated.
Signal to indicate the start of a speaker's speech. Sent when voice activity is detected in the audio stream.
Signal raised when STT detects silence after speech has stopped. Useful for tracking pauses and implementing custom turn-taking logic.
Was this page helpful?