Skip to main content
The node-stt template illustrates how to convert speech-to-text using the STT (Speech-to-Text) node.
Architecture
  • Backend: Inworld Runtime
  • Frontend: N/A (CLI example)

Run the Template

  1. Clone the templates repository:
    git clone https://github.com/inworld-ai/inworld-runtime-templates-node
    cd inworld-runtime-templates-node
    
  2. Install the Runtime SDK inside the cli directory.
    yarn add @inworld/runtime
    
  3. Set up your Base64 Runtime API key by copying the .env-sample file into a .env file in the cli folder and adding your API key.
    .env
    # Inworld Runtime Base64 API key
    INWORLD_API_KEY=<your_api_key_here>
    
  4. Run this code in your console, providing a WAV audio file:
yarn node-stt --audioFilePath=path/to/your/audio.wav

Understanding the Template

The main functionality of the template is contained in the run function, which demonstrates how to use the Inworld Runtime to convert speech-to-text using the STT node. Let’s break it down into more detail:

1) Audio input preparation

First, we read and decode the WAV audio file to prepare it for processing:
const { audioFilePath, apiKey } = parseArgs();

const audioData = await WavDecoder.decode(fs.readFileSync(audioFilePath));

2) Node Initialization

Then, we create the STT node:
const sttNode = new RemoteSTTNode();

3) Graph initialization

Next, we create the graph using the GraphBuilder, adding the STT node and setting it as both start and end node:
const graph = new GraphBuilder({
  id: 'node_stt_graph',
  apiKey,
  enableRemoteConfig: false,
})
  .addNode(sttNode)
  .setStartNode(sttNode)
  .setEndNode(sttNode)
  .build();
The GraphBuilder configuration includes:
  • id: A unique identifier for the graph
  • apiKey: Your Inworld API key for authentication
  • enableRemoteConfig: Whether to enable remote configuration (set to false for local execution)
In this example, we only have a single STT node, setting it as the start and end node. In more complex applications, you could connect multiple nodes to create a processing pipeline.

4) Graph execution

Now we execute the graph with the audio data directly as an input object.
const { outputStream } = await graph.start(
  new GraphTypes.Audio({
    data: Array.from(audioData.channelData[0] || []),
    sampleRate: audioData.sampleRate,
  }),
);
The audio input is wrapped in a GraphTypes.Audio object that contains:
  • data: The audio channel data converted to an array
  • sampleRate: The sample rate of the audio file

5) Response handling

The transcription results are handled using the processResponse method, which supports both streaming and non-streaming text responses:
let result = '';
let resultCount = 0;

for await (const resp of outputStream) {
  await resp.processResponse({
    string: (text: string) => {
      result += text;
      resultCount++;
    },
    TextStream: async (textStream: any) => {
      for await (const chunk of textStream) {
        if (chunk.text) {
          result += chunk.text;
          resultCount++;
        }
      }
    },
    default: (data: any) => {
      if (typeof data === 'string') {
        result += data;
        resultCount++;
      } else {
        console.log('Unprocessed response:', data);
      }
    },
  });
}

console.log(`Result count: ${resultCount}`);
console.log(`Result: ${result}`);
The response handler supports multiple response types:
  • string: Direct string responses containing transcribed text
  • TextStream: Streaming text responses for real-time transcription
  • default: Fallback handler for any other response types