Skip to main content
Voice Profile analyzes vocal characteristics of the speaker alongside transcription. It returns structured classification data for Age, Emotion, Vocal Style, and Accent, each with confidence scores ranging from 0.0 to 1.0. Voice Profile is available across all STT models on the Inworld STT API. By understanding who is speaking and how they are speaking, applications can adapt responses, adjust tone, route conversations, or trigger context-sensitive behaviors in real time.

Use cases

  • Voice agents and NPCs — Adapt responses based on the speaker’s detected emotion or vocal style (e.g., respond empathetically to a sad tone).
  • Accessibility — Detect age category or vocal style to adjust UI, pacing, or interaction complexity.
  • Content moderation — Flag unusual vocal patterns (shouting, crying) for escalation or review.
  • Analytics and insights — Aggregate emotion and vocal style data across sessions for user experience analysis.
  • Localization — Use accent detection to dynamically select language models or localized content.

How it works

Voice Profile analysis runs automatically when the voice profile threshold is configured via inworldConfig.voiceProfileThreshold in your request (HTTP or WebSocket). The confidence threshold controls which labels are returned — only labels with a confidence score at or above the threshold are included in the response. Set the threshold via the voiceProfileThreshold field inside inworldConfig. Default: 0.5. Range: 0.0–1.0.

Classification categories

Age

Estimates the speaker’s age category. Returns a single label with the highest confidence.
LabelDescription
youngYoung adult / teenager
adultAdult speaker
kidChild speaker
oldElderly speaker
unclearAge could not be determined

Emotion

Detects emotional tone in the speaker’s voice. Returns multiple labels ranked by confidence.
LabelDescription
tenderSoft, gentle, caring tone
sadSorrowful or melancholy tone
calmRelaxed, even-tempered delivery
neutralNo strong emotional signal
happyCheerful, upbeat tone
angryFrustrated, aggressive tone
fearfulAnxious or frightened tone
surprisedStartled or astonished tone
disgustedRevulsion or strong disapproval
unclearEmotion could not be determined

Vocal Style

Identifies the speaker’s manner of delivery. Returns multiple labels ranked by confidence.
LabelDescription
whisperingHushed, breathy delivery
normalStandard conversational speech
singingMelodic or musical delivery
mumblingUnclear, low-articulation speech
cryingSpeech accompanied by crying
laughingSpeech accompanied by laughter
shoutingLoud, raised-voice delivery
monotoneFlat, unvaried pitch delivery
unclearVocal style could not be determined

Accent

Detects the speaker’s accent or regional dialect using BCP-47 locale codes. Returns a single label with the highest confidence, plus additional candidates ranked below.
LabelRegion
en-USAmerican English
en-GBBritish English
en-AUAustralian English
zh-CNMandarin Chinese
fr-FRFrench (France)
es-ESSpanish (Spain)
es-419Spanish (Latin America)
es-MXSpanish (Mexico)
ar-EGArabic (Egypt)
Additional accent locales may be returned beyond those listed above. The model supports a broad range of BCP-47 codes.

Configuration

The STT API accepts both camelCase and snake_case field names (e.g., transcribeConfig / transcribe_config, voiceProfileThreshold / voice_profile_threshold). The examples below use camelCase. Set voiceProfileThreshold inside inworldConfig:
{
  "transcribeConfig": {
    "modelId": "<MODEL_ID>",
    "language": "en-US",
    "audioEncoding": "MP3",
    "inworldConfig": {
      "voiceProfileThreshold": 0.5
    }
  }
}
Use any STT model that supports Voice Profiles (for example, groq/whisper-large-v3 for synchronous HTTP, or the assemblyai/... streaming models listed in the STT overview).

WebSocket (Streaming)

Include inworldConfig in the first WebSocket message:
{
  "transcribeConfig": {
    "modelId": "<MODEL_ID>",
    "audioEncoding": "LINEAR16",
    "inworldConfig": {
      "voiceProfileThreshold": 0.5
    }
  }
}

Configuration parameters

FieldTypeDefaultDescription
voice_profile_threshold / voiceProfileThresholdfloat0.5Minimum confidence score (0.0–1.0) for a label to be included in the response. Higher values return fewer, more confident labels.

Response structure

The voiceProfile object is returned alongside transcription and usage in both sync and streaming responses. Each category contains a label and a confidence score.

Example response (sync)

{
  "transcription": {
    "transcript": "Hey, I just wanted to check in on the delivery status.",
    "isFinal": true
  },
  "voiceProfile": {
    "age": { "label": "young", "confidence": 0.78 },
    "emotion": [
      { "label": "tender", "confidence": 0.97 },
      { "label": "sad", "confidence": 0.03 }
    ],
    "vocal_style": [
      { "label": "whispering", "confidence": 0.97 },
      { "label": "normal", "confidence": 0.03 }
    ],
    "accent": { "label": "en-US", "confidence": 0.48 }
  },
  "usage": {
    "transcribed_audio_ms": 3200,
    "model_id": "inworld/inworld-stt-1"
  }
}

Response fields

FieldTypeDescription
voiceProfile.ageClassLabelSingle label: estimated age category of the speaker.
voiceProfile.emotionClassLabelArray of detected emotions, ranked by confidence. Multiple emotions may be present.
voiceProfile.vocal_styleClassLabelArray of detected vocal styles, ranked by confidence. Multiple styles may be present.
voiceProfile.accentClassLabelSingle label: detected accent as a BCP-47 locale code.
Each ClassLabel contains:
  • label (string) — The predicted class name
  • confidence (float) — Score from 0.0 to 1.0

Best practices

  • Start with the default threshold (0.5) — This filters out low-confidence noise while keeping useful labels. Lower the threshold if you need broader signal; raise it for precision-critical use cases.
  • Use emotion and vocal style together — Combining both categories gives a richer picture. A “tender” emotion with “whispering” vocal style tells a different story than “tender” with “normal” style.
  • Handle missing fields gracefully — Voice Profile fields may be absent if the model cannot make a confident classification or if the audio quality is insufficient. Always check for presence before accessing.
  • Accent is probabilistic — Accent detection returns the most likely locale, not a definitive answer. Use it as a signal rather than a hard routing decision.
  • Test with representative audio — Classification accuracy depends on audio quality, background noise, and speech duration. Test with samples that reflect your production environment.