> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Transcribe audio (WebSocket)

> Bidirectional streaming API for real-time speech-to-text transcription over WebSocket.

This method listens for streaming audio input and returns recognized text chunks one by one as soon as they are ready. Audio chunks are expected to be a part of a single voice input. Suitable for streaming live conversations, microphone input, or other streaming audio sources.

To use the API:
- Send a `transcribeConfig` message first to configure the session (model, language, audio encoding, etc.).
- Stream `audioChunk` messages containing raw audio bytes.
- Receive `transcription` results as they become available, including both interim (partial) and final results.
- Listen for `speechStarted` and `speechStopped` events to detect voice activity changes.
- Optionally send `endTurn` to signal end of a speaker's turn.
- Send `closeStream` when done.


## AsyncAPI

````yaml api-reference/sttAPI/transcribe-stream-websocket.json sttStream
id: sttStream
title: Stt stream
description: Primary WebSocket channel for bidirectional speech-to-text streaming.
servers:
  - id: production
    protocol: wss
    host: api.inworld.ai
    bindings: []
    variables: []
address: /stt/v1/transcribe:streamBidirectional
parameters: []
bindings: []
operations:
  - &ref_1
    id: sendRequest
    title: Send request
    type: receive
    messages:
      - &ref_3
        id: transcribeConfig
        payload:
          - name: transcribeConfig
            description: >-
              Configure the transcription session. Must be the first message
              sent. Contains model selection, audio format settings, and
              optional feature configurations.
            type: object
            properties:
              - name: modelId
                type: string
                description: >-
                  The identifier of the model to use for transcription. Format:
                  "{provider}/{model-name}".


                  Available models:

                  - `inworld/inworld-stt-1` — Inworld first-party

                  - `assemblyai/universal-streaming-multilingual` — AssemblyAI
                  multilingual

                  - `assemblyai/universal-streaming-english` — AssemblyAI
                  English

                  - `assemblyai/u3-rt-pro` — AssemblyAI high-accuracy

                  - `assemblyai/whisper-rt` — AssemblyAI Whisper real-time

                  - `soniox/stt-rt-v4` — Soniox real-time

                  - `soniox/stt-rt-v5` — Soniox real-time (latest generation)

                  - `deepgram/flux-general-en` — Deepgram Flux English
                  conversational

                  - `deepgram/flux-general-multi` — Deepgram Flux multilingual
                  conversational


                  See [STT Introduction](/stt/overview) for the full model
                  catalogue.
                required: true
              - name: audioEncoding
                type: string
                description: |-
                  Supported audio encoding formats.

                   - `AUDIO_ENCODING_UNSPECIFIED`: Not specified. Will return an error.
                   - `AUTO_DETECT`: Automatically detect audio encoding from the audio header.
                   - `LINEAR16`: Uncompressed 16-bit signed little-endian samples (Linear PCM).
                   - `MP3`: MP3 audio. Not supported for streaming transcription.
                   - `OGG_OPUS`: Opus encoded audio wrapped in an OGG container. Not supported for streaming transcription.
                   - `FLAC`: FLAC encoded audio. Lossless format. Not supported for streaming transcription.
                enumValues:
                  - AUDIO_ENCODING_UNSPECIFIED
                  - AUTO_DETECT
                  - LINEAR16
                  - MP3
                  - OGG_OPUS
                  - FLAC
                required: true
              - name: language
                type: string
                description: >-
                  Language hint in ISO 639 format (e.g., "en", "ja"). Biases the
                  model toward the specified language during automatic language
                  detection. BCP-47 codes (e.g., "en-US") are also accepted and
                  converted to the base language code. For the Inworld
                  first-party model (`inworld/inworld-stt-1`), the hint
                  additionally constrains the output script for English,
                  Chinese, Cantonese, Japanese, Korean, Russian, and Hindi (e.g.
                  selecting `en` keeps output in Latin script). See [Language
                  Support](/stt/languages) for the full list of supported
                  languages.
                required: false
              - name: sampleRateHertz
                type: integer
                description: >-
                  Sample rate of the audio data in Hertz. Required when the
                  sample rate cannot be inferred from the audio header (e.g.,
                  raw PCM streams). Default: 16000.
                required: false
              - name: numberOfChannels
                type: integer
                description: >-
                  Number of channels in the audio data. Required when the number
                  of channels cannot be inferred from the audio header (e.g.,
                  raw PCM streams). Default: 1.
                required: false
              - name: inactivityTimeoutSeconds
                type: integer
                description: >-
                  Inactivity timeout in seconds. If the client is silent for
                  this duration, the transcription will be stopped.
                required: false
              - name: endOfTurnConfidenceThreshold
                type: number
                description: >-
                  Confidence threshold for end-of-turn prediction. Higher values
                  reduce false-positives. Range: [0.0, 1.0]. Default: 0.5. See
                  the [Turn Detection guide](/stt/turn-detection) for tuning
                  guidance.
                required: false
              - name: prompts
                type: array
                description: >-
                  Custom vocabulary / key terms. An array of context strings
                  (names, jargon, acronyms) that bias the model toward
                  recognizing these terms. This is a soft bias that helps with
                  ambiguous or uncommon words; it is not a hard keyword lock and
                  does not force exact output. Supported across models — the
                  unified field maps to each provider's mechanism (Groq
                  `prompt`, AssemblyAI `keyterms_prompt`, Soniox `context`). Use
                  letters, digits, spaces, and basic punctuation; other
                  characters (such as #, /, @, or |) are rejected by the gateway
                  with INVALID_ARGUMENT (code 3).
                required: false
                properties:
                  - name: item
                    type: string
                    required: false
              - name: includeWordTimestamps
                type: boolean
                description: >-
                  If true, includes per-word timing information in the response.
                  Available for AssemblyAI and Soniox models. Not yet supported
                  for `inworld/inworld-stt-1`.
                required: false
              - name: groqConfig
                type: object
                description: Configuration for Groq streaming STT models.
                required: false
                properties:
                  - name: temperature
                    type: number
                    description: >-
                      Temperature for the model. Controls randomness in
                      predictions. Higher values produce more varied output.
                      Range: [0.0, 1.0].
                    required: false
              - name: assemblyaiConfig
                type: object
                description: Configuration for AssemblyAI streaming STT models.
                required: false
                properties:
                  - name: minEndOfTurnSilenceWhenConfident
                    type: integer
                    description: >-
                      Minimum silence duration when confidence is high
                      (milliseconds).
                    required: false
                  - name: maxTurnSilence
                    type: integer
                    description: >-
                      Maximum allowed silence before forcing a turn boundary
                      (milliseconds).
                    required: false
                  - name: vadThreshold
                    type: number
                    description: >-
                      Voice activity detection threshold. Range: [0.0, 1.0].
                      Default: 0.5.
                    required: false
                  - name: prompt
                    type: string
                    description: >-
                      Custom transcription instructions for the model. Works
                      only for Universal-3 Pro Streaming.
                    required: false
              - name: inworldSttV1Config
                type: object
                description: >-
                  Configuration for Inworld STT 1 models. Set `vadThreshold` to
                  0 to disable server-side turn detection and control turn
                  boundaries manually via `endTurn` — see the [Turn Detection
                  guide](/stt/turn-detection).
                required: false
                properties:
                  - name: minEndOfTurnSilenceWhenConfident
                    type: integer
                    description: >-
                      Minimum silence duration when confidence is high
                      (milliseconds).
                    required: false
                  - name: vadThreshold
                    type: number
                    description: >-
                      Voice activity detection threshold. Range: [0.0, 1.0].
                      Default: 0.5.
                    required: false
              - name: sonioxConfig
                type: object
                description: Configuration for Soniox streaming STT models.
                required: false
                properties:
                  - name: languageHints
                    type: array
                    description: >-
                      Language hints to guide the model. If set, will override
                      the `language` field from the main config.
                    required: false
                    properties:
                      - name: item
                        type: string
                        required: false
                  - name: languageHintsStrict
                    type: boolean
                    description: >-
                      If true, model will strongly prefer producing languages
                      only from the `languageHints` list.
                    required: false
                  - name: enableEndpointDetection
                    type: boolean
                    description: >-
                      If true, enables intelligent semantic-based end-of-turn
                      detection.
                    required: false
                  - name: maxEndpointDelayMs
                    type: integer
                    description: >-
                      Maximum allowed delay between the end of the previous turn
                      and the start of the next turn (milliseconds). Must be
                      between 500 and 5000 milliseconds, default is 2000.
                    required: false
                  - name: context
                    type: object
                    description: Contextual information to guide the Soniox model.
                    required: false
                    properties:
                      - name: general
                        type: object
                        description: >-
                          Structured key-value information (domain, topic,
                          intent, etc.).
                        required: false
                      - name: text
                        type: string
                        description: Longer free-form background text or related documents.
                        required: false
                      - name: terms
                        type: array
                        description: Domain-specific or uncommon words.
                        required: false
                        properties:
                          - name: item
                            type: string
                            required: false
              - name: voiceProfileConfig
                type: object
                description: Configuration for voice profile detection.
                required: false
                properties:
                  - name: enableVoiceProfile
                    type: boolean
                    description: Enables voice profile feature for this request or stream.
                    required: true
                  - name: topN
                    type: integer
                    description: >-
                      Number of top labels from each class to return. Default:
                      10.
                    required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            modelId:
              type: string
              description: >-
                The identifier of the model to use for transcription. Format:
                "{provider}/{model-name}".


                Available models:

                - `inworld/inworld-stt-1` — Inworld first-party

                - `assemblyai/universal-streaming-multilingual` — AssemblyAI
                multilingual

                - `assemblyai/universal-streaming-english` — AssemblyAI English

                - `assemblyai/u3-rt-pro` — AssemblyAI high-accuracy

                - `assemblyai/whisper-rt` — AssemblyAI Whisper real-time

                - `soniox/stt-rt-v4` — Soniox real-time

                - `soniox/stt-rt-v5` — Soniox real-time (latest generation)

                - `deepgram/flux-general-en` — Deepgram Flux English
                conversational

                - `deepgram/flux-general-multi` — Deepgram Flux multilingual
                conversational


                See [STT Introduction](/stt/overview) for the full model
                catalogue.
              example: assemblyai/universal-streaming-multilingual
              x-parser-schema-id: <anonymous-schema-1>
            audioEncoding:
              type: string
              enum:
                - AUDIO_ENCODING_UNSPECIFIED
                - AUTO_DETECT
                - LINEAR16
                - MP3
                - OGG_OPUS
                - FLAC
              default: AUDIO_ENCODING_UNSPECIFIED
              description: |-
                Supported audio encoding formats.

                 - `AUDIO_ENCODING_UNSPECIFIED`: Not specified. Will return an error.
                 - `AUTO_DETECT`: Automatically detect audio encoding from the audio header.
                 - `LINEAR16`: Uncompressed 16-bit signed little-endian samples (Linear PCM).
                 - `MP3`: MP3 audio. Not supported for streaming transcription.
                 - `OGG_OPUS`: Opus encoded audio wrapped in an OGG container. Not supported for streaming transcription.
                 - `FLAC`: FLAC encoded audio. Lossless format. Not supported for streaming transcription.
              x-parser-schema-id: AudioEncoding
            language:
              type: string
              description: >-
                Language hint in ISO 639 format (e.g., "en", "ja"). Biases the
                model toward the specified language during automatic language
                detection. BCP-47 codes (e.g., "en-US") are also accepted and
                converted to the base language code. For the Inworld first-party
                model (`inworld/inworld-stt-1`), the hint additionally
                constrains the output script for English, Chinese, Cantonese,
                Japanese, Korean, Russian, and Hindi (e.g. selecting `en` keeps
                output in Latin script). See [Language Support](/stt/languages)
                for the full list of supported languages.
              example: en
              x-parser-schema-id: <anonymous-schema-2>
            sampleRateHertz:
              type: integer
              format: int32
              description: >-
                Sample rate of the audio data in Hertz. Required when the sample
                rate cannot be inferred from the audio header (e.g., raw PCM
                streams). Default: 16000.
              example: 16000
              x-parser-schema-id: <anonymous-schema-3>
            numberOfChannels:
              type: integer
              format: int32
              description: >-
                Number of channels in the audio data. Required when the number
                of channels cannot be inferred from the audio header (e.g., raw
                PCM streams). Default: 1.
              example: 1
              x-parser-schema-id: <anonymous-schema-4>
            inactivityTimeoutSeconds:
              type: integer
              format: int32
              description: >-
                Inactivity timeout in seconds. If the client is silent for this
                duration, the transcription will be stopped.
              x-parser-schema-id: <anonymous-schema-5>
            endOfTurnConfidenceThreshold:
              type: number
              format: float
              description: >-
                Confidence threshold for end-of-turn prediction. Higher values
                reduce false-positives. Range: [0.0, 1.0]. Default: 0.5. See the
                [Turn Detection guide](/stt/turn-detection) for tuning guidance.
              x-parser-schema-id: <anonymous-schema-6>
            prompts:
              type: array
              items:
                type: string
                x-parser-schema-id: <anonymous-schema-8>
              description: >-
                Custom vocabulary / key terms. An array of context strings
                (names, jargon, acronyms) that bias the model toward recognizing
                these terms. This is a soft bias that helps with ambiguous or
                uncommon words; it is not a hard keyword lock and does not force
                exact output. Supported across models — the unified field maps
                to each provider's mechanism (Groq `prompt`, AssemblyAI
                `keyterms_prompt`, Soniox `context`). Use letters, digits,
                spaces, and basic punctuation; other characters (such as #, /,
                @, or |) are rejected by the gateway with INVALID_ARGUMENT (code
                3).
              x-parser-schema-id: <anonymous-schema-7>
            includeWordTimestamps:
              type: boolean
              description: >-
                If true, includes per-word timing information in the response.
                Available for AssemblyAI and Soniox models. Not yet supported
                for `inworld/inworld-stt-1`.
              x-parser-schema-id: <anonymous-schema-9>
            groqConfig:
              type: object
              properties:
                temperature:
                  type: number
                  format: float
                  description: >-
                    Temperature for the model. Controls randomness in
                    predictions. Higher values produce more varied output.
                    Range: [0.0, 1.0].
                  x-parser-schema-id: <anonymous-schema-10>
              description: Configuration for Groq streaming STT models.
              x-parser-schema-id: GroqConfig
            assemblyaiConfig:
              type: object
              properties:
                minEndOfTurnSilenceWhenConfident:
                  type: integer
                  format: int32
                  description: >-
                    Minimum silence duration when confidence is high
                    (milliseconds).
                  x-parser-schema-id: <anonymous-schema-11>
                maxTurnSilence:
                  type: integer
                  format: int32
                  description: >-
                    Maximum allowed silence before forcing a turn boundary
                    (milliseconds).
                  x-parser-schema-id: <anonymous-schema-12>
                vadThreshold:
                  type: number
                  format: float
                  description: >-
                    Voice activity detection threshold. Range: [0.0, 1.0].
                    Default: 0.5.
                  x-parser-schema-id: <anonymous-schema-13>
                prompt:
                  type: string
                  description: >-
                    Custom transcription instructions for the model. Works only
                    for Universal-3 Pro Streaming.
                  x-parser-schema-id: <anonymous-schema-14>
              description: Configuration for AssemblyAI streaming STT models.
              x-parser-schema-id: AssemblyAIConfig
            inworldSttV1Config:
              type: object
              properties:
                minEndOfTurnSilenceWhenConfident:
                  type: integer
                  format: int32
                  description: >-
                    Minimum silence duration when confidence is high
                    (milliseconds).
                  x-parser-schema-id: <anonymous-schema-15>
                vadThreshold:
                  type: number
                  format: float
                  description: >-
                    Voice activity detection threshold. Range: [0.0, 1.0].
                    Default: 0.5.
                  x-parser-schema-id: <anonymous-schema-16>
              description: >-
                Configuration for Inworld STT 1 models. Set `vadThreshold` to 0
                to disable server-side turn detection and control turn
                boundaries manually via `endTurn` — see the [Turn Detection
                guide](/stt/turn-detection).
              x-parser-schema-id: InworldSttV1Config
            sonioxConfig:
              type: object
              properties:
                languageHints:
                  type: array
                  items:
                    type: string
                    x-parser-schema-id: <anonymous-schema-18>
                  description: >-
                    Language hints to guide the model. If set, will override the
                    `language` field from the main config.
                  x-parser-schema-id: <anonymous-schema-17>
                languageHintsStrict:
                  type: boolean
                  description: >-
                    If true, model will strongly prefer producing languages only
                    from the `languageHints` list.
                  x-parser-schema-id: <anonymous-schema-19>
                enableEndpointDetection:
                  type: boolean
                  description: >-
                    If true, enables intelligent semantic-based end-of-turn
                    detection.
                  x-parser-schema-id: <anonymous-schema-20>
                maxEndpointDelayMs:
                  type: integer
                  format: int32
                  description: >-
                    Maximum allowed delay between the end of the previous turn
                    and the start of the next turn (milliseconds). Must be
                    between 500 and 5000 milliseconds, default is 2000.
                  x-parser-schema-id: <anonymous-schema-21>
                context:
                  type: object
                  properties:
                    general:
                      type: object
                      additionalProperties:
                        type: string
                        x-parser-schema-id: <anonymous-schema-23>
                      description: >-
                        Structured key-value information (domain, topic, intent,
                        etc.).
                      x-parser-schema-id: <anonymous-schema-22>
                    text:
                      type: string
                      description: Longer free-form background text or related documents.
                      x-parser-schema-id: <anonymous-schema-24>
                    terms:
                      type: array
                      items:
                        type: string
                        x-parser-schema-id: <anonymous-schema-26>
                      description: Domain-specific or uncommon words.
                      x-parser-schema-id: <anonymous-schema-25>
                  description: Contextual information to guide the Soniox model.
                  x-parser-schema-id: SonioxConfigContext
              description: Configuration for Soniox streaming STT models.
              x-parser-schema-id: SonioxConfig
            voiceProfileConfig:
              type: object
              properties:
                enableVoiceProfile:
                  type: boolean
                  description: Enables voice profile feature for this request or stream.
                  x-parser-schema-id: <anonymous-schema-27>
                topN:
                  type: integer
                  format: int32
                  description: 'Number of top labels from each class to return. Default: 10.'
                  x-parser-schema-id: <anonymous-schema-28>
              required:
                - enableVoiceProfile
              description: Configuration for voice profile detection.
              x-parser-schema-id: VoiceProfileConfig
          required:
            - modelId
            - audioEncoding
          examples:
            - modelId: assemblyai/universal-streaming-multilingual
              audioEncoding: LINEAR16
              sampleRateHertz: 16000
              language: en-US
          x-parser-schema-id: TranscribeConfigPayload
        title: Transcribe config
        description: >-
          Configure the transcription session. Must be the first message sent.
          Contains model selection, audio format settings, and optional feature
          configurations.
        example: |-
          {
            "modelId": "assemblyai/universal-streaming-multilingual",
            "audioEncoding": "LINEAR16",
            "sampleRateHertz": 16000,
            "language": "en-US"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: transcribeConfig
      - &ref_4
        id: audioChunk
        payload:
          - name: audioChunk
            description: >-
              Send a chunk of audio data for transcription. Must be sent after
              the initial transcribe config message.
            type: object
            properties:
              - name: content
                type: string
                description: >-
                  The raw audio bytes in the encoding specified by the
                  transcribe config's audioEncoding.
                required: true
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            content:
              type: string
              format: byte
              description: >-
                The raw audio bytes in the encoding specified by the transcribe
                config's audioEncoding.
              example: <YOUR_AUDIO>
              x-parser-schema-id: <anonymous-schema-29>
          required:
            - content
          examples:
            - content: <YOUR_AUDIO>
          x-parser-schema-id: AudioChunkPayload
        title: Audio chunk
        description: >-
          Send a chunk of audio data for transcription. Must be sent after the
          initial transcribe config message.
        example: |-
          {
            "content": "<YOUR_AUDIO>"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: audioChunk
      - &ref_5
        id: endTurn
        payload:
          - name: endTurn
            description: >-
              Signal the end of a speaker's turn. Some providers do not support
              manual turn-taking; for those providers, sending this message will
              have no effect. See the [Turn Detection
              guide](/stt/turn-detection) for automatic vs. manual turn control.
            type: object
            properties: []
        headers: []
        jsonPayloadSchema:
          type: object
          properties: {}
          examples:
            - {}
          x-parser-schema-id: EndTurnPayload
        title: End turn
        description: >-
          Signal the end of a speaker's turn. Some providers do not support
          manual turn-taking; for those providers, sending this message will
          have no effect. See the [Turn Detection guide](/stt/turn-detection)
          for automatic vs. manual turn control.
        example: '{}'
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: endTurn
      - &ref_6
        id: closeStream
        payload:
          - name: closeStream
            description: >-
              Signal that the client is done sending audio data. Required for
              HTTP/WebSocket clients since there is no equivalent to gRPC stream
              close.
            type: object
            properties: []
        headers: []
        jsonPayloadSchema:
          type: object
          properties: {}
          examples:
            - {}
          x-parser-schema-id: CloseStreamPayload
        title: Close stream
        description: >-
          Signal that the client is done sending audio data. Required for
          HTTP/WebSocket clients since there is no equivalent to gRPC stream
          close.
        example: '{}'
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: closeStream
    bindings: []
    extensions: &ref_0
      - id: x-parser-unique-object-id
        value: sttStream
  - &ref_2
    id: receiveResponse
    title: Receive response
    type: send
    messages:
      - &ref_7
        id: transcription
        payload:
          - name: transcription
            description: >-
              Transcription result streamed back as audio is processed. May be
              an interim (partial) result or a final result depending on the
              `isFinal` field.
            type: object
            properties:
              - name: transcript
                type: string
                description: Full transcribed text for this segment.
                required: false
              - name: isFinal
                type: boolean
                description: >-
                  Indicates whether this is a finalized result or an interim
                  (partial) result that may be updated as more audio is
                  processed.
                required: false
              - name: wordTimestamps
                type: array
                description: >-
                  Per-word timing and confidence data. Only populated when
                  `includeWordTimestamps` is enabled. Available for AssemblyAI
                  and Soniox models. Not yet supported for
                  `inworld/inworld-stt-1`.
                required: false
                properties:
                  - name: word
                    type: string
                    description: The transcribed word.
                    required: false
                  - name: confidence
                    type: number
                    description: >-
                      Recognition confidence score for this word. Range: [0.0,
                      1.0].
                    required: false
                  - name: startTimeMs
                    type: integer
                    description: >-
                      Offset from the beginning of the audio to the start of
                      this word, in milliseconds.
                    required: false
                  - name: endTimeMs
                    type: integer
                    description: >-
                      Offset from the beginning of the audio to the end of this
                      word, in milliseconds.
                    required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            transcript:
              type: string
              description: Full transcribed text for this segment.
              x-parser-schema-id: <anonymous-schema-30>
            isFinal:
              type: boolean
              description: >-
                Indicates whether this is a finalized result or an interim
                (partial) result that may be updated as more audio is processed.
              x-parser-schema-id: <anonymous-schema-31>
            wordTimestamps:
              type: array
              items:
                type: object
                properties:
                  word:
                    type: string
                    description: The transcribed word.
                    x-parser-schema-id: <anonymous-schema-33>
                  confidence:
                    type: number
                    format: float
                    description: >-
                      Recognition confidence score for this word. Range: [0.0,
                      1.0].
                    x-parser-schema-id: <anonymous-schema-34>
                  startTimeMs:
                    type: integer
                    format: int32
                    description: >-
                      Offset from the beginning of the audio to the start of
                      this word, in milliseconds.
                    x-parser-schema-id: <anonymous-schema-35>
                  endTimeMs:
                    type: integer
                    format: int32
                    description: >-
                      Offset from the beginning of the audio to the end of this
                      word, in milliseconds.
                    x-parser-schema-id: <anonymous-schema-36>
                x-parser-schema-id: WordTimestamp
              description: >-
                Per-word timing and confidence data. Only populated when
                `includeWordTimestamps` is enabled. Available for AssemblyAI and
                Soniox models. Not yet supported for `inworld/inworld-stt-1`.
              x-parser-schema-id: <anonymous-schema-32>
          examples:
            - transcript: Hello, this is a test transcription.
              isFinal: true
              wordTimestamps: []
            - transcript: Hello, this is
              isFinal: false
              wordTimestamps: []
          x-parser-schema-id: TranscriptionResponsePayload
        title: Transcription
        description: >-
          Transcription result streamed back as audio is processed. May be an
          interim (partial) result or a final result depending on the `isFinal`
          field.
        example: |-
          {
            "transcript": "Hello, this is a test transcription.",
            "isFinal": true,
            "wordTimestamps": []
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: transcription
      - &ref_8
        id: usage
        payload:
          - name: usage
            description: >-
              Usage metrics for billing and monitoring purposes. **Coming soon**
              — this field is not yet populated.
            type: object
            properties:
              - name: transcribedAudioMs
                type: integer
                description: The duration of the transcribed audio in milliseconds.
                required: false
              - name: modelId
                type: string
                description: The identifier of the model used for transcription.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            transcribedAudioMs:
              type: integer
              format: int32
              description: The duration of the transcribed audio in milliseconds.
              x-parser-schema-id: <anonymous-schema-37>
            modelId:
              type: string
              description: The identifier of the model used for transcription.
              x-parser-schema-id: <anonymous-schema-38>
          description: >-
            Usage metrics for billing and monitoring purposes. **Coming soon** —
            this field is not yet populated.
          x-parser-schema-id: UsageResponsePayload
        title: Usage
        description: >-
          Usage metrics for billing and monitoring purposes. **Coming soon** —
          this field is not yet populated.
        example: |-
          {
            "transcribedAudioMs": 123,
            "modelId": "<string>"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: usage
      - &ref_9
        id: speechStarted
        payload:
          - name: speechStarted
            description: >-
              Signal to indicate the start of a speaker's speech. Sent when
              voice activity is detected in the audio stream.
            type: object
            properties:
              - name: startTimeMs
                type: integer
                description: The timestamp of the start of the speech in milliseconds.
                required: false
              - name: confidence
                type: number
                description: The confidence score of the speech detection.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            startTimeMs:
              type: integer
              format: int32
              description: The timestamp of the start of the speech in milliseconds.
              x-parser-schema-id: <anonymous-schema-39>
            confidence:
              type: number
              format: float
              description: The confidence score of the speech detection.
              x-parser-schema-id: <anonymous-schema-40>
          examples:
            - startTimeMs: 1250
              confidence: 0.95
          x-parser-schema-id: SpeechStartedResponsePayload
        title: Speech started
        description: >-
          Signal to indicate the start of a speaker's speech. Sent when voice
          activity is detected in the audio stream.
        example: |-
          {
            "startTimeMs": 1250,
            "confidence": 0.95
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: speechStarted
      - &ref_10
        id: speechStopped
        payload:
          - name: speechStopped
            description: >-
              Signal raised when STT detects silence after speech has stopped.
              Useful for tracking pauses and implementing custom turn-taking
              logic.
            type: object
            properties:
              - name: silenceDurationMs
                type: integer
                description: >-
                  The duration of silence detected after speech stopped, in
                  milliseconds.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            silenceDurationMs:
              type: integer
              format: int32
              description: >-
                The duration of silence detected after speech stopped, in
                milliseconds.
              x-parser-schema-id: <anonymous-schema-41>
          examples:
            - silenceDurationMs: 750
          x-parser-schema-id: SpeechStoppedResponsePayload
        title: Speech stopped
        description: >-
          Signal raised when STT detects silence after speech has stopped.
          Useful for tracking pauses and implementing custom turn-taking logic.
        example: |-
          {
            "silenceDurationMs": 750
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: speechStopped
    bindings: []
    extensions: *ref_0
sendOperations:
  - *ref_1
receiveOperations:
  - *ref_2
sendMessages:
  - *ref_3
  - *ref_4
  - *ref_5
  - *ref_6
receiveMessages:
  - *ref_7
  - *ref_8
  - *ref_9
  - *ref_10
extensions:
  - id: x-parser-unique-object-id
    value: sttStream
securitySchemes:
  - id: auth
    name: authorization
    type: httpApiKey
    description: >-
      Your [authentication](../../../api-reference/introduction) credentials.
      For Basic authentication, please populate `Basic $INWORLD_API_KEY`. You
      can create a key in one command with the [Inworld
      CLI](../../../tts/resources/inworld-cli): `inworld workspace add-key`.
    in: query
    extensions: []

````