{
"create": {
"voiceId": "Dennis",
"modelId": "inworld-tts-1-max",
"bufferCharThreshold": 100
},
"contextId": "ctx-1"
}{
"send_text": {
"text": "Hello, what a wonderful day to be a text-to-speech model!",
"flush_context": {}
},
"contextId": "ctx-1"
}{
"flush_context": {},
"contextId": "ctx-1"
}{
"update_config": {
"voiceId": "Ashley"
},
"contextId": "ctx-1"
}{
"close_context": {},
"contextId": "ctx-1"
}{
"result": {
"contextId": "ctx-1",
"contextCreated": {
"voiceId": "Dennis",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1-max",
"timestampType": "WORD",
"maxBufferDelayMs": 3000
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"audioChunk": {
"audioContent": "UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=",
"usage": {
"processedCharactersCount": 79,
"modelId": "inworld-tts-1"
},
"timestampInfo": {
"wordAlignment": {
"words": [
"Hello,",
"what",
"a",
"wonderful",
"day",
"to",
"be",
"a",
"text-to-speech",
"model."
],
"wordStartTimeSeconds": [
0.031,
0.375,
0.901,
1.002,
1.386,
1.548,
1.649,
1.771,
1.852,
2.58
],
"wordEndTimeSeconds": [
0.355,
0.86,
0.921,
1.326,
1.528,
1.609,
1.71,
1.791,
2.539,
2.802
]
}
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}
}{
"result": {
"contextId": "ctx-1",
"contextClosed": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"contextUpdated": {
"voiceId": "Ashley",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1-max",
"temperature": 1,
"timestampType": "WORD",
"maxBufferDelayMs": 3000
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}Generate audio from text input while managing multiple independent audio generation streams over a single WebSocket connection.
The independent audio streams each correspond to a context, identified by contextId, that maintains its own state. To use the API:
maxBufferDelayMs and bufferCharThreshold in the context configurations).contextId so you can match the audio to the request.{
"create": {
"voiceId": "Dennis",
"modelId": "inworld-tts-1-max",
"bufferCharThreshold": 100
},
"contextId": "ctx-1"
}{
"send_text": {
"text": "Hello, what a wonderful day to be a text-to-speech model!",
"flush_context": {}
},
"contextId": "ctx-1"
}{
"flush_context": {},
"contextId": "ctx-1"
}{
"update_config": {
"voiceId": "Ashley"
},
"contextId": "ctx-1"
}{
"close_context": {},
"contextId": "ctx-1"
}{
"result": {
"contextId": "ctx-1",
"contextCreated": {
"voiceId": "Dennis",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1-max",
"timestampType": "WORD",
"maxBufferDelayMs": 3000
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"audioChunk": {
"audioContent": "UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=",
"usage": {
"processedCharactersCount": 79,
"modelId": "inworld-tts-1"
},
"timestampInfo": {
"wordAlignment": {
"words": [
"Hello,",
"what",
"a",
"wonderful",
"day",
"to",
"be",
"a",
"text-to-speech",
"model."
],
"wordStartTimeSeconds": [
0.031,
0.375,
0.901,
1.002,
1.386,
1.548,
1.649,
1.771,
1.852,
2.58
],
"wordEndTimeSeconds": [
0.355,
0.86,
0.921,
1.326,
1.528,
1.609,
1.71,
1.791,
2.539,
2.802
]
}
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}
}{
"result": {
"contextId": "ctx-1",
"contextClosed": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"contextUpdated": {
"voiceId": "Ashley",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1-max",
"temperature": 1,
"timestampType": "WORD",
"maxBufferDelayMs": 3000
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_RUNTIME_BASE64_CREDENTIAL
Create a new context with specified voice and configuration. Note: for each connection, 5 contexts is the max.
Send text to be synthesized for a specific context
Flush a context and start synthesis of all accumulated text
Update the configuration of an existing context. Note: sending an update message forces the synthesis of all accumulated text in the context's buffer with the previous configurations.
Close an existing context
Event sent when a new TTS context has been successfully created
Audio data chunk containing synthesized speech
Event sent when a context has been closed
Event sent when a context configuration has been successfully updated
Was this page helpful?