Convert Speech-to-Text (STT) - Inworld AI Documentation

The node-stt template illustrates how to convert speech-to-text using the STT (Speech-to-Text) node.

Architecture

Backend: Inworld Runtime
Frontend: N/A (CLI example)

Run the Template

git clone https://github.com/inworld-ai/inworld-runtime-templates-node
cd inworld-runtime-templates-node

Install the Runtime SDK inside the cli directory.
yarn add @inworld/runtime
Set up your Base64 Runtime API key by copying the .env-sample file into a .env file in the cli folder and adding your API key.
.env
```
# Inworld Runtime Base64 API key
INWORLD_API_KEY=<your_api_key_here>
```
Run this code in your console, providing a WAV audio file:

yarn node-stt --audioFilePath=path/to/your/audio.wav

Understanding the Template

The main functionality of the template is contained in the run function, which demonstrates how to use the Inworld Runtime to convert speech-to-text using the STT node. Let’s break it down into more detail:

1) Audio input preparation

First, we read and decode the WAV audio file to prepare it for processing:

const { audioFilePath, apiKey } = parseArgs();

const audioData = await WavDecoder.decode(fs.readFileSync(audioFilePath));

2) Node Initialization

Then, we create the STT node:

const sttNode = new RemoteSTTNode();

3) Graph initialization

Next, we create the graph using the GraphBuilder, adding the STT node and setting it as both start and end node:

const graph = new GraphBuilder({
  id: 'node_stt_graph',
  apiKey,
  enableRemoteConfig: false,
})
  .addNode(sttNode)
  .setStartNode(sttNode)
  .setEndNode(sttNode)
  .build();

The GraphBuilder configuration includes:

id: A unique identifier for the graph
apiKey: Your Inworld API key for authentication
enableRemoteConfig: Whether to enable remote configuration (set to false for local execution)

In this example, we only have a single STT node, setting it as the start and end node. In more complex applications, you could connect multiple nodes to create a processing pipeline.

4) Graph execution

Now we execute the graph with the audio data directly as an input object.

const { outputStream } = await graph.start(
  new GraphTypes.Audio({
    data: Array.from(audioData.channelData[0] || []),
    sampleRate: audioData.sampleRate,
  }),
);

The audio input is wrapped in a GraphTypes.Audio object that contains:

data: The audio channel data converted to an array
sampleRate: The sample rate of the audio file

5) Response handling

The transcription results are handled using the processResponse method, which supports both streaming and non-streaming text responses:

let result = '';
let resultCount = 0;

for await (const resp of outputStream) {
  await resp.processResponse({
    string: (text: string) => {
      result += text;
      resultCount++;
    },
    TextStream: async (textStream: any) => {
      for await (const chunk of textStream) {
        if (chunk.text) {
          result += chunk.text;
          resultCount++;
        }
      }
    },
    default: (data: any) => {
      if (typeof data === 'string') {
        result += data;
        resultCount++;
      } else {
        console.log('Unprocessed response:', data);
      }
    },
  });
}

console.log(`Result count: ${resultCount}`);
console.log(`Result: ${result}`);

The response handler supports multiple response types:

string: Direct string responses containing transcribed text
TextStream: Streaming text responses for real-time transcription
default: Fallback handler for any other response types

​Run the Template

​Understanding the Template

​1) Audio input preparation

​2) Node Initialization

​3) Graph initialization

​4) Graph execution

​5) Response handling

Run the Template

Understanding the Template

1) Audio input preparation

2) Node Initialization

3) Graph initialization

4) Graph execution

5) Response handling