Skip to main content
Voice Profile analyzes vocal characteristics of the speaker alongside transcription. It returns structured classification data for Age, Emotion, Pitch, Vocal Style, and Accent, each with confidence scores ranging from 0.0 to 1.0. Voice Profile is available across all STT models on the Inworld STT API. By understanding who is speaking and how they are speaking, applications can adapt responses, adjust tone, route conversations, or trigger context-sensitive behaviors in real time.

Use cases

  • Voice agents and NPCs — Adapt responses based on the speaker’s detected emotion or vocal style (e.g., respond empathetically to a sad tone).
  • Accessibility — Detect age category or vocal style to adjust UI, pacing, or interaction complexity.
  • Content moderation — Flag unusual vocal patterns (shouting, crying) for escalation or review.
  • Analytics and insights — Aggregate emotion and vocal style data across sessions for user experience analysis.
  • Localization — Use accent detection to dynamically select language models or localized content.

How it works

Voice Profile analysis runs when enabled via voiceProfileConfig in your transcribeConfig request (HTTP or WebSocket). Set enableVoiceProfile to true to activate the feature. Optionally use topN to control how many top labels per category are returned.

Classification categories

Age

Estimates the speaker’s age category. Returns an array of labels sorted by descending confidence.
LabelDescription
youngYoung adult / teenager
adultAdult speaker
kidChild speaker
oldElderly speaker
unclearAge could not be determined

Emotion

Detects emotional tone in the speaker’s voice. Returns an array of labels sorted by descending confidence.
LabelDescription
tenderSoft, gentle, caring tone
sadSorrowful or melancholy tone
calmRelaxed, even-tempered delivery
neutralNo strong emotional signal
happyCheerful, upbeat tone
angryFrustrated, aggressive tone
fearfulAnxious or frightened tone
surprisedStartled or astonished tone
disgustedRevulsion or strong disapproval
unclearEmotion could not be determined

Pitch

Classifies the speaker’s vocal pitch. Pitch shifts during a conversation can serve as a real-time emotional signal — a voice moving from lower to higher pitch can correlate with rising stress, excitement, or urgency, while a dropping pitch may indicate the speaker is becoming more withdrawn, tired, or deflated. Returns an array of labels sorted by descending confidence.
LabelDescription
lowLow-pitched voice
mediumMedium-pitched voice
highHigh-pitched voice

Vocal Style

Identifies the speaker’s manner of delivery. Returns an array of labels sorted by descending confidence.
LabelDescription
whisperingHushed, breathy delivery
normalStandard conversational speech
singingMelodic or musical delivery
mumblingUnclear, low-articulation speech
cryingSpeech accompanied by crying
laughingSpeech accompanied by laughter
shoutingLoud, raised-voice delivery
monotoneFlat, unvaried pitch delivery
unclearVocal style could not be determined

Accent

Detects the speaker’s accent or regional dialect using BCP-47 locale codes. Returns an array of labels sorted by descending confidence.
LabelRegion
en-USAmerican English
en-GBBritish English
en-AUAustralian English
zh-CNMandarin Chinese
fr-FRFrench (France)
es-ESSpanish (Spain)
es-419Spanish (Latin America)
es-MXSpanish (Mexico)
ar-EGArabic (Egypt)
Additional accent locales may be returned beyond those listed above. The model supports a broad range of BCP-47 codes.

Configuration

The STT API accepts both camelCase and snake_case field names (e.g., transcribeConfig / transcribe_config, voiceProfileConfig / voice_profile_config). The examples below use camelCase. Set voiceProfileConfig inside transcribeConfig:
{
  "transcribeConfig": {
    "modelId": "<MODEL_ID>",
    "language": "en-US",
    "audioEncoding": "MP3",
    "voiceProfileConfig": {
      "enableVoiceProfile": true,
      "topN": 5
    }
  }
}
Use any STT model that supports Voice Profiles (for example, groq/whisper-large-v3 for synchronous HTTP, or the assemblyai/... streaming models listed in the STT overview).

WebSocket (Streaming)

Include voiceProfileConfig in the first WebSocket message:
{
  "transcribeConfig": {
    "modelId": "<MODEL_ID>",
    "audioEncoding": "LINEAR16",
    "voiceProfileConfig": {
      "enableVoiceProfile": true,
      "topN": 5
    }
  }
}

Configuration parameters

FieldTypeDefaultDescription
enable_voice_profile / enableVoiceProfileboolRequired. Set to true to enable Voice Profile analysis for the request or stream.
top_n / topNinteger10Number of top labels from each classification category to return.

Response structure

The voiceProfile object is returned alongside transcription and usage in both sync and streaming responses. Each category is an array of { label, confidence } objects, sorted by descending confidence. The JSON below shows the normalized response shape (camelCase throughout). Raw API payloads may use snake_case for the same fields (for example vocal_style, transcribed_audio_ms, model_id). Prefer representing one layer per example in your own docs and client code — either the raw API shape or the normalized shape — not a mix of both.

Example response (sync, normalized shape)

{
  "transcription": {
    "transcript": "Hey, I just wanted to check in on the delivery status.",
    "isFinal": true
  },
  "voiceProfile": {
    "age": [
      { "label": "young", "confidence": 0.78 }
    ],
    "emotion": [
      { "label": "tender", "confidence": 0.97 },
      { "label": "sad", "confidence": 0.03 }
    ],
    "pitch": [
      { "label": "medium", "confidence": 0.85 }
    ],
    "vocalStyle": [
      { "label": "whispering", "confidence": 0.97 },
      { "label": "normal", "confidence": 0.03 }
    ],
    "accent": [
      { "label": "en-US", "confidence": 0.48 }
    ]
  },
  "usage": {
    "transcribedAudioMs": 3200,
    "modelId": "inworld/inworld-stt-1"
  }
}

Response fields

FieldTypeDescription
voiceProfile.ageClassLabel[]Array of detected age categories, sorted by descending confidence.
voiceProfile.emotionClassLabel[]Array of detected emotions, sorted by descending confidence.
voiceProfile.pitchClassLabel[]Array of detected pitch levels, sorted by descending confidence. Pitch drift across a conversation can signal changes in emotional state.
voiceProfile.vocalStyleClassLabel[]Array of detected vocal styles, sorted by descending confidence.
voiceProfile.accentClassLabel[]Array of detected accents as BCP-47 locale codes, sorted by descending confidence.
Each ClassLabel object contains:
  • label (string) — The predicted class name
  • confidence (float) — Score from 0.0 to 1.0

Best practices

  • Start with the default topN (10) — This returns up to 10 labels per category, sorted by descending confidence. Lower topN if you only need the most confident predictions; raise it if you need broader signal.
  • Use emotion and vocal style together — Combining both categories gives a richer picture. A “tender” emotion with “whispering” vocal style tells a different story than “tender” with “normal” style.
  • Handle missing fields gracefully — Voice Profile fields may be absent if the model cannot make a confident classification or if the audio quality is insufficient. Always check for presence before accessing.
  • Accent is probabilistic — Accent detection returns the most likely locale, not a definitive answer. Use it as a signal rather than a hard routing decision.
  • Test with representative audio — Classification accuracy depends on audio quality, background noise, and speech duration. Test with samples that reflect your production environment.