This integration relies on the Inworld Router API’s LLM + TTS streaming response. Because LLM inference and TTS synthesis happen in one call, the first sentence of audio starts playing before the LLM has finished writing the full response, resulting in improved latency.
Prerequisites
- Node.js v20 or later
- ngrok account with a reserved static domain (the free tier is sufficient)
- Twilio account with a phone number that has Voice capability.
- Inworld account with a Router API key
Setup
The steps below walk through the reference implementation in inworld-ai/inworld-api-examples.1. Clone the example repo
2. Get your Inworld API key
Sign in to the Inworld Portal, open your workspace, and create an API key. The same key works for Router (LLM) and TTS because the integration uses the combined LLM + TTS endpoint.3. Get a Twilio phone number
In the Twilio Console, buy a phone number with Voice capability. This is the number callers will dial.4. Reserve an ngrok static domain (for local development)
Install ngrok and reserve a free static domain in the ngrok dashboard. A static domain matters here because Twilio’s webhook URL needs to stay stable between restarts. Without one, every new ngrok session changes the tunnel URL and you have to update the Twilio webhook by hand.5. Configure environment
Copy the example env file and fill in the required variables:6. Install and run
Install dependencies:7. Point your Twilio number at the webhook
In the Twilio Console, go to Phone Numbers → Manage → Active Numbers → your number. Under Voice Configuration, set A call comes in to Webhook, enterhttps://your-ngrok-domain.ngrok-free.app/voice, and choose HTTP POST. Save the configuration.
ngrok is only needed for local development so Twilio can reach a server running on your machine. Once you deploy the server to production, update the Twilio webhook to point at your server’s public URL (for example,
https://voice.yourdomain.com/voice) and you can drop ngrok entirely.How it works
- An inbound call hits
/voice, and the server returns TwiML that hands the call off to ConversationRelay. - ConversationRelay handles the call audio and runs speech-to-text (Deepgram by default), then opens a WebSocket to your server and streams user transcripts as they arrive.
- For each user turn, the server calls the Inworld Router API’s chat completions endpoint with an
audioblock, which returns an SSE stream of both text deltas and base64-encoded PCM audio chunks. This is the LLM + TTS feature. - Inworld’s chunking engine groups the response into text segments at natural sentence boundaries, and each segment’s audio may arrive as multiple base64-encoded PCM chunks. The server assembles the audio for each segment, wraps it in a WAV header, hosts it at a short-lived HTTP URL, and sends a
playmessage to ConversationRelay so the caller hears the first sentence before the LLM has finished generating the rest. - On barge-in, ConversationRelay sends an
interruptmessage and the server aborts the in-flight Inworld stream viaAbortController. Partial responses already spoken are kept in conversation history so context is preserved.
/voice looks like this:
Test your integration
Call your Twilio number. The bot should greet you and hold a conversation.Example implementation
Twilio ConversationRelay integration example
A complete Node.js reference implementation that bridges Twilio ConversationRelay to the Inworld Router API with LLM + TTS.
Further reading
LLM + TTS (Voice Responses)
How to request combined LLM text and Inworld TTS audio in a single streaming call.
ConversationRelay TwiML Reference
Twilio’s reference for configuring ConversationRelay, including voice hints, parameters, and language options.
ConversationRelay WebSocket Protocol
Full specification of the WebSocket messages exchanged between ConversationRelay and your server.
List TTS Voices
API reference for fetching available Inworld TTS voice IDs to use in the
audio block.