Voice Profile analyzes vocal characteristics of the speaker alongside transcription. It returns structured classification data for Age, Emotion, Pitch, Vocal Style, and Accent, each with confidence scores ranging from 0.0 to 1.0.
Voice Profile is available across all STT models on the Inworld STT API. By understanding who is speaking and how they are speaking, applications can adapt responses, adjust tone, route conversations, or trigger context-sensitive behaviors in real time.
Use cases
- Voice agents and NPCs — Adapt responses based on the speaker’s detected emotion or vocal style (e.g., respond empathetically to a sad tone).
- Accessibility — Detect age category or vocal style to adjust UI, pacing, or interaction complexity.
- Content moderation — Flag unusual vocal patterns (shouting, crying) for escalation or review.
- Analytics and insights — Aggregate emotion and vocal style data across sessions for user experience analysis.
- Localization — Use accent detection to dynamically select language models or localized content.
How it works
Voice Profile analysis runs when enabled via voiceProfileConfig in your transcribeConfig request (HTTP or WebSocket). Set enableVoiceProfile to true to activate the feature. Optionally use topN to control how many top labels per category are returned.
Classification categories
Age
Estimates the speaker’s age category. Returns an array of labels sorted by descending confidence.
| Label | Description |
|---|
young | Young adult / teenager |
adult | Adult speaker |
kid | Child speaker |
old | Elderly speaker |
unclear | Age could not be determined |
Emotion
Detects emotional tone in the speaker’s voice. Returns an array of labels sorted by descending confidence.
| Label | Description |
|---|
tender | Soft, gentle, caring tone |
sad | Sorrowful or melancholy tone |
calm | Relaxed, even-tempered delivery |
neutral | No strong emotional signal |
happy | Cheerful, upbeat tone |
angry | Frustrated, aggressive tone |
fearful | Anxious or frightened tone |
surprised | Startled or astonished tone |
disgusted | Revulsion or strong disapproval |
unclear | Emotion could not be determined |
Pitch
Classifies the speaker’s vocal pitch. Pitch shifts during a conversation can serve as a real-time emotional signal — a voice moving from lower to higher pitch can correlate with rising stress, excitement, or urgency, while a dropping pitch may indicate the speaker is becoming more withdrawn, tired, or deflated. Returns an array of labels sorted by descending confidence.
| Label | Description |
|---|
low | Low-pitched voice |
medium | Medium-pitched voice |
high | High-pitched voice |
Vocal Style
Identifies the speaker’s manner of delivery. Returns an array of labels sorted by descending confidence.
| Label | Description |
|---|
whispering | Hushed, breathy delivery |
normal | Standard conversational speech |
singing | Melodic or musical delivery |
mumbling | Unclear, low-articulation speech |
crying | Speech accompanied by crying |
laughing | Speech accompanied by laughter |
shouting | Loud, raised-voice delivery |
monotone | Flat, unvaried pitch delivery |
unclear | Vocal style could not be determined |
Accent
Detects the speaker’s accent or regional dialect using BCP-47 locale codes. Returns an array of labels sorted by descending confidence.
| Label | Region |
|---|
en-US | American English |
en-GB | British English |
en-AU | Australian English |
zh-CN | Mandarin Chinese |
fr-FR | French (France) |
es-ES | Spanish (Spain) |
es-419 | Spanish (Latin America) |
es-MX | Spanish (Mexico) |
ar-EG | Arabic (Egypt) |
Additional accent locales may be returned beyond those listed above. The model supports a broad range of BCP-47 codes.
Configuration
The STT API accepts both camelCase and snake_case field names (e.g., transcribeConfig / transcribe_config, voiceProfileConfig / voice_profile_config). The examples below use camelCase.
Set voiceProfileConfig inside transcribeConfig:
{
"transcribeConfig": {
"modelId": "<MODEL_ID>",
"language": "en-US",
"audioEncoding": "MP3",
"voiceProfileConfig": {
"enableVoiceProfile": true,
"topN": 5
}
}
}
Use any STT model that supports Voice Profiles (for example, groq/whisper-large-v3 for synchronous HTTP, or the assemblyai/... streaming models listed in the STT overview).
WebSocket (Streaming)
Include voiceProfileConfig in the first WebSocket message:
{
"transcribeConfig": {
"modelId": "<MODEL_ID>",
"audioEncoding": "LINEAR16",
"voiceProfileConfig": {
"enableVoiceProfile": true,
"topN": 5
}
}
}
Configuration parameters
| Field | Type | Default | Description |
|---|
enable_voice_profile / enableVoiceProfile | bool | — | Required. Set to true to enable Voice Profile analysis for the request or stream. |
top_n / topN | integer | 10 | Number of top labels from each classification category to return. |
Response structure
The voiceProfile object is returned alongside transcription and usage in both sync and streaming responses. Each category is an array of { label, confidence } objects, sorted by descending confidence.
The JSON below shows the normalized response shape (camelCase throughout). Raw API payloads may use snake_case for the same fields (for example vocal_style, transcribed_audio_ms, model_id). Prefer representing one layer per example in your own docs and client code — either the raw API shape or the normalized shape — not a mix of both.
Example response (sync, normalized shape)
{
"transcription": {
"transcript": "Hey, I just wanted to check in on the delivery status.",
"isFinal": true
},
"voiceProfile": {
"age": [
{ "label": "young", "confidence": 0.78 }
],
"emotion": [
{ "label": "tender", "confidence": 0.97 },
{ "label": "sad", "confidence": 0.03 }
],
"pitch": [
{ "label": "medium", "confidence": 0.85 }
],
"vocalStyle": [
{ "label": "whispering", "confidence": 0.97 },
{ "label": "normal", "confidence": 0.03 }
],
"accent": [
{ "label": "en-US", "confidence": 0.48 }
]
},
"usage": {
"transcribedAudioMs": 3200,
"modelId": "inworld/inworld-stt-1"
}
}
Response fields
| Field | Type | Description |
|---|
voiceProfile.age | ClassLabel[] | Array of detected age categories, sorted by descending confidence. |
voiceProfile.emotion | ClassLabel[] | Array of detected emotions, sorted by descending confidence. |
voiceProfile.pitch | ClassLabel[] | Array of detected pitch levels, sorted by descending confidence. Pitch drift across a conversation can signal changes in emotional state. |
voiceProfile.vocalStyle | ClassLabel[] | Array of detected vocal styles, sorted by descending confidence. |
voiceProfile.accent | ClassLabel[] | Array of detected accents as BCP-47 locale codes, sorted by descending confidence. |
Each ClassLabel object contains:
- label (string) — The predicted class name
- confidence (float) — Score from 0.0 to 1.0
Best practices
- Start with the default
topN (10) — This returns up to 10 labels per category, sorted by descending confidence. Lower topN if you only need the most confident predictions; raise it if you need broader signal.
- Use emotion and vocal style together — Combining both categories gives a richer picture. A “tender” emotion with “whispering” vocal style tells a different story than “tender” with “normal” style.
- Handle missing fields gracefully — Voice Profile fields may be absent if the model cannot make a confident classification or if the audio quality is insufficient. Always check for presence before accessing.
- Accent is probabilistic — Accent detection returns the most likely locale, not a definitive answer. Use it as a signal rather than a hard routing decision.
- Test with representative audio — Classification accuracy depends on audio quality, background noise, and speech duration. Test with samples that reflect your production environment.