> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Speech To Text (STT) Primitive Demo

The Speech To Text (STT) template demonstrates how to perform STT using the STT primitive.

This demo also uses a Voice Activity Detection (VAD) module to detect when the player is speaking.

## Run the Template

1. Go to `Assets/InworldRuntime/Scenes/Primitives` and play the `STTTemplate` scene.
   <img src="https://mintcdn.com/inworldai/pDD5vvrZThONehMe/img/unity/framework/STT00.png?fit=max&auto=format&n=pDD5vvrZThONehMe&q=85&s=f71560a6bbfe8503c5bb8258ca4a3983" alt="STT00" width="938" height="537" data-path="img/unity/framework/STT00.png" />
2. When the game starts, stay quiet for a moment to let the microphone calibrate to background noise.
3. Once you see the `Calibrated` message, speak into the microphone.
4. You'll see the transcribed text appear on screen.
   <img src="https://mintcdn.com/inworldai/pDD5vvrZThONehMe/img/unity/framework/STT.gif?s=eb8e758e22c8d5c6344524e174e030b0" alt="STT" width="1920" height="1080" data-path="img/unity/framework/STT.gif" />

## Understanding the Template

### Structure

* This demo has two prefabs under `InworldController`: `STT` (contains `InworldSTTModule`) and `VAD` (contains `InworldVADModule`).
* When `InworldController` initializes, it calls `InitializeAsync()` on both modules (see [Primitives Overview](./overview)).
* These functions create `STTFactory` and `VADFactory`, and each factory creates its `STTInterface` or `VADInterface` based on the current `STT/VADConfig`.

<img src="https://mintcdn.com/inworldai/pDD5vvrZThONehMe/img/unity/framework/STT01.png?fit=max&auto=format&n=pDD5vvrZThONehMe&q=85&s=07573897fa37736f3d0cb8c81dbebcc4" alt="LLM01" width="1200" height="447" data-path="img/unity/framework/STT01.png" />

### InworldAudioManager

`InworldAudioManager` handles audio processing and is also modular. In this demo, it uses four components:

* **AudioCapturer**: Manages microphone on/off and input devices. Uses Unity's `Microphone` by default, and can be extended via third‑party plugins.
* **AudioCollector**: Collects raw samples from the microphone.
* **PlayerVoiceDetector**: Implements `IPlayerAudioEventHandler` and `ICalibrateAudioHandler` to emit player audio events and decide which timestamped segments to keep from the stream.

<Tip>
  For example, `TurnBasedVoiceDetector` automatically pauses capture while the character is speaking to prevent echo.

  In this demo, `VoiceActivityDetector` extends `PlayerVoiceDetector` and leverages an AI model to accurately detect when the player is speaking.
</Tip>

* **AudioDispatcher**: Sends the captured microphone data for downstream processing.

<img src="https://mintcdn.com/inworldai/PEMIBdkx0YyDrDSz/img/unity/framework/AudioManager.png?fit=max&auto=format&n=PEMIBdkx0YyDrDSz&q=85&s=d4513a92cab3cb9aaae2101bc80fba07" alt="AudioManager" width="1203" height="546" data-path="img/unity/framework/AudioManager.png" />

### Workflow

**Audio Thread:**\
At startup, the microphone calibrates to background noise.

The VAD (Voice Activity Detection) module listens for speech, and when speech is detected, the `AudioDispatcher` streams audio frames to the STT module.

Both partial and final transcriptions are produced and displayed in the UI.

Since this section mainly covers STT, detailed explanations about audio capture will be described later.

**Main Thread:**\
In this demo's `STTCanvas`, each audio-thread event is registered in the `OnEnable` method.

Certain simple events, such as starting or stopping calibration, are handled directly (for example, updating on-screen text):

```c# STTCanvas.cs theme={"system"}
void OnEnable()
{
    if (!m_Audio)
        return;
    m_Audio.Event.onStartCalibrating.AddListener(() => Title("Calibrating"));
    m_Audio.Event.onStopCalibrating.AddListener(Calibrated);
    m_Audio.Event.onPlayerStartSpeaking.AddListener(() => Title("PlayerSpeaking"));
    m_Audio.Event.onPlayerStopSpeaking.AddListener(() =>
    {
        Title("");
        if (m_STTResult)
            m_STTResult.text = "";
    });
    m_Audio.Event.onAudioSent.AddListener((audioData) =>
    {
        AudioChunk chunk = new AudioChunk();
        InworldVector<float> floatArray = new InworldVector<float>();
        foreach (float data in audioData)
        {
            floatArray.Add(data);
        }
        chunk.SampleRate = 16000;
        chunk.Data = floatArray;
        _ = InworldController.STT.RecognizeSpeechAsync(chunk);
    });
    InworldController.Instance.OnFrameworkInitialized += OnFrameworkInitialized;
    InworldController.STT.OnTaskFinished += OnSpeechRecognized;
}
```

When the `onAudioSent` event is received, we assemble the audio data into an `AudioChunk`—the audio should be resampled to mono with a sample rate of 16,000 Hz—and call `InworldController.STT.RecognizeSpeechAsync()`.

This function checks whether the STT module exists and has been initialized (i.e., the `STTInterface` is valid).

If so, it directly calls `sttInterface.RecognizeSpeech`, returns the transcription string, and displays it on the `STTCanvas`.

```c# InworldController.cs theme={"system"}
public async Awaitable<string> RecognizeSpeechAsync(AudioChunk audioChunk)
{
    string result = "";
    if (!Initialized || !(m_Interface is STTInterface sttInterface))
        return result;
    m_SpeechRecognitionConfig ??= new SpeechRecognitionConfig();
    if (m_InputStream != null)
    {
        m_InputStream.Dispose();
        m_InputStream = null;
    }
    m_InputStream ??= sttInterface.RecognizeSpeech(audioChunk, m_SpeechRecognitionConfig);
    ...
```
