Voice Profiles

Voice Profile analyzes vocal characteristics of the speaker alongside transcription. It returns structured classification data for Age, Emotion, Pitch, Vocal Style, and Accent, each with confidence scores ranging from 0.0 to 1.0. Voice Profile is available across all STT models on the Inworld STT API. By understanding who is speaking and how they are speaking, applications can adapt responses, adjust tone, route conversations, or trigger context-sensitive behaviors in real time.

Use cases

Voice agents and NPCs — Adapt responses based on the speaker’s detected emotion or vocal style (e.g., respond empathetically to a sad tone).
Accessibility — Detect age category or vocal style to adjust UI, pacing, or interaction complexity.
Content moderation — Flag unusual vocal patterns (shouting, crying) for escalation or review.
Analytics and insights — Aggregate emotion and vocal style data across sessions for user experience analysis.
Localization — Use accent detection to dynamically select language models or localized content.

How it works

Voice Profile analysis runs when enabled via voiceProfileConfig in your transcribeConfig request (HTTP or WebSocket). Set enableVoiceProfile to true to activate the feature. Optionally use topN to control how many top labels per category are returned.

Classification categories

Age

Estimates the speaker’s age category. Returns an array of labels sorted by descending confidence.

Label	Description
`young`	Young adult / teenager
`adult`	Adult speaker
`kid`	Child speaker
`old`	Elderly speaker
`unclear`	Age could not be determined

Emotion

Detects emotional tone in the speaker’s voice. Returns an array of labels sorted by descending confidence.

Label	Description
`tender`	Soft, gentle, caring tone
`sad`	Sorrowful or melancholy tone
`calm`	Relaxed, even-tempered delivery
`neutral`	No strong emotional signal
`happy`	Cheerful, upbeat tone
`angry`	Frustrated, aggressive tone
`fearful`	Anxious or frightened tone
`surprised`	Startled or astonished tone
`disgusted`	Revulsion or strong disapproval
`unclear`	Emotion could not be determined

Pitch

Classifies the speaker’s vocal pitch. Pitch shifts during a conversation can serve as a real-time emotional signal — a voice moving from lower to higher pitch can correlate with rising stress, excitement, or urgency, while a dropping pitch may indicate the speaker is becoming more withdrawn, tired, or deflated. Returns an array of labels sorted by descending confidence.

Label	Description
`low`	Low-pitched voice
`medium`	Medium-pitched voice
`high`	High-pitched voice

Vocal Style

Identifies the speaker’s manner of delivery. Returns an array of labels sorted by descending confidence.

Label	Description
`whispering`	Hushed, breathy delivery
`normal`	Standard conversational speech
`singing`	Melodic or musical delivery
`mumbling`	Unclear, low-articulation speech
`crying`	Speech accompanied by crying
`laughing`	Speech accompanied by laughter
`shouting`	Loud, raised-voice delivery
`monotone`	Flat, unvaried pitch delivery
`unclear`	Vocal style could not be determined

Accent

Detects the speaker’s accent or regional dialect using BCP-47 locale codes. Returns an array of labels sorted by descending confidence.

Label	Region
`en-US`	American English
`en-GB`	British English
`en-AU`	Australian English
`zh-CN`	Mandarin Chinese
`fr-FR`	French (France)
`es-ES`	Spanish (Spain)
`es-419`	Spanish (Latin America)
`es-MX`	Spanish (Mexico)
`ar-EG`	Arabic (Egypt)

Additional accent locales may be returned beyond those listed above. The model supports a broad range of BCP-47 codes.

Configuration

The STT API accepts both camelCase and snake_case field names (e.g., transcribeConfig / transcribe_config, voiceProfileConfig / voice_profile_config). The examples below use camelCase. Set voiceProfileConfig inside transcribeConfig:

{
  "transcribeConfig": {
    "modelId": "<MODEL_ID>",
    "language": "en",
    "audioEncoding": "MP3",
    "voiceProfileConfig": {
      "enableVoiceProfile": true,
      "topN": 5
    }
  }
}

Use any STT model that supports Voice Profiles (for example, groq/whisper-large-v3 for synchronous HTTP, or the assemblyai/... streaming models listed in the STT overview).

WebSocket (Streaming)

Include voiceProfileConfig in the first WebSocket message:

{
  "transcribeConfig": {
    "modelId": "<MODEL_ID>",
    "audioEncoding": "LINEAR16",
    "voiceProfileConfig": {
      "enableVoiceProfile": true,
      "topN": 5
    }
  }
}

Configuration parameters

Field	Type	Default	Description
`enable_voice_profile` / `enableVoiceProfile`	bool	—	Required. Set to `true` to enable Voice Profile analysis for the request or stream.
`top_n` / `topN`	integer	`10`	Number of top labels from each classification category to return.

Response structure

The voiceProfile object is returned alongside transcription and usage in both sync and streaming responses. Each category is an array of { label, confidence } objects, sorted by descending confidence. The JSON below shows the normalized response shape (camelCase throughout). Raw API payloads may use snake_case for the same fields (for example vocal_style, transcribed_audio_ms, model_id). Prefer representing one layer per example in your own docs and client code — either the raw API shape or the normalized shape — not a mix of both.

Example response (sync, normalized shape)

{
  "transcription": {
    "transcript": "Hey, I just wanted to check in on the delivery status.",
    "isFinal": true
  },
  "voiceProfile": {
    "age": [
      { "label": "young", "confidence": 0.78 }
    ],
    "emotion": [
      { "label": "tender", "confidence": 0.97 },
      { "label": "sad", "confidence": 0.03 }
    ],
    "pitch": [
      { "label": "medium", "confidence": 0.85 }
    ],
    "vocalStyle": [
      { "label": "whispering", "confidence": 0.97 },
      { "label": "normal", "confidence": 0.03 }
    ],
    "accent": [
      { "label": "en-US", "confidence": 0.48 }
    ]
  },
  "usage": {
    "transcribedAudioMs": 3200,
    "modelId": "inworld/inworld-stt-1"
  }
}

Response fields

Field	Type	Description
`voiceProfile.age`	ClassLabel[]	Array of detected age categories, sorted by descending confidence.
`voiceProfile.emotion`	ClassLabel[]	Array of detected emotions, sorted by descending confidence.
`voiceProfile.pitch`	ClassLabel[]	Array of detected pitch levels, sorted by descending confidence. Pitch drift across a conversation can signal changes in emotional state.
`voiceProfile.vocalStyle`	ClassLabel[]	Array of detected vocal styles, sorted by descending confidence.
`voiceProfile.accent`	ClassLabel[]	Array of detected accents as BCP-47 locale codes, sorted by descending confidence.

Each ClassLabel object contains:

label (string) — The predicted class name
confidence (float) — Score from 0.0 to 1.0

Best practices

Start with the default topN (10) — This returns up to 10 labels per category, sorted by descending confidence. Lower topN if you only need the most confident predictions; raise it if you need broader signal.
Use emotion and vocal style together — Combining both categories gives a richer picture. A “tender” emotion with “whispering” vocal style tells a different story than “tender” with “normal” style.
Handle missing fields gracefully — Voice Profile fields may be absent if the model cannot make a confident classification or if the audio quality is insufficient. Always check for presence before accessing.
Accent is probabilistic — Accent detection returns the most likely locale, not a definitive answer. Use it as a signal rather than a hard routing decision.
Test with representative audio — Classification accuracy depends on audio quality, background noise, and speech duration. Test with samples that reflect your production environment.

Get Started

Resources

Use cases

How it works

Classification categories

Age

Emotion

Pitch

Vocal Style

Accent

Configuration

WebSocket (Streaming)

Configuration parameters

Response structure

Example response (sync, normalized shape)

Response fields

Best practices

​Use cases

​How it works

​Classification categories

​Age

​Emotion

​Pitch

​Vocal Style

​Accent

​Configuration

​WebSocket (Streaming)

​Configuration parameters

​Response structure

​Example response (sync, normalized shape)

​Response fields

​Best practices

Use cases

How it works

Classification categories

Age

Emotion

Pitch

Vocal Style

Accent

Configuration

WebSocket (Streaming)

Configuration parameters

Response structure

Example response (sync, normalized shape)

Response fields

Best practices