> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Voice Profiles

> Analyze speaker vocal characteristics alongside transcription using Voice Profile on the Inworld STT API.

Voice Profile analyzes vocal characteristics of the speaker alongside transcription. It returns structured classification data for **Age**, **Emotion**, **Pitch**, **Vocal Style**, and **Accent**, each with confidence scores ranging from 0.0 to 1.0.

Voice Profile is available across all STT models on the Inworld STT API. By understanding *who* is speaking and *how* they are speaking, applications can adapt responses, adjust tone, route conversations, or trigger context-sensitive behaviors in real time.

## Use cases

* **Voice agents and NPCs** — Adapt responses based on the speaker's detected emotion or vocal style (e.g., respond empathetically to a sad tone).
* **Accessibility** — Detect age category or vocal style to adjust UI, pacing, or interaction complexity.
* **Content moderation** — Flag unusual vocal patterns (shouting, crying) for escalation or review.
* **Analytics and insights** — Aggregate emotion and vocal style data across sessions for user experience analysis.
* **Localization** — Use accent detection to dynamically select language models or localized content.

## How it works

Voice Profile analysis runs when enabled via `voiceProfileConfig` in your `transcribeConfig` request (HTTP or WebSocket). Set `enableVoiceProfile` to `true` to activate the feature. Optionally use `topN` to control how many top labels per category are returned.

## Classification categories

### Age

Estimates the speaker's age category. Returns an array of labels sorted by descending confidence.

| **Label** | **Description**             |
| :-------- | :-------------------------- |
| `young`   | Young adult / teenager      |
| `adult`   | Adult speaker               |
| `kid`     | Child speaker               |
| `old`     | Elderly speaker             |
| `unclear` | Age could not be determined |

### Emotion

Detects emotional tone in the speaker's voice. Returns an array of labels sorted by descending confidence.

| **Label**   | **Description**                 |
| :---------- | :------------------------------ |
| `tender`    | Soft, gentle, caring tone       |
| `sad`       | Sorrowful or melancholy tone    |
| `calm`      | Relaxed, even-tempered delivery |
| `neutral`   | No strong emotional signal      |
| `happy`     | Cheerful, upbeat tone           |
| `angry`     | Frustrated, aggressive tone     |
| `fearful`   | Anxious or frightened tone      |
| `surprised` | Startled or astonished tone     |
| `disgusted` | Revulsion or strong disapproval |
| `unclear`   | Emotion could not be determined |

### Pitch

Classifies the speaker's vocal pitch. Pitch shifts during a conversation can serve as a real-time emotional signal — a voice moving from lower to higher pitch can correlate with rising stress, excitement, or urgency, while a dropping pitch may indicate the speaker is becoming more withdrawn, tired, or deflated. Returns an array of labels sorted by descending confidence.

| **Label** | **Description**      |
| :-------- | :------------------- |
| `low`     | Low-pitched voice    |
| `medium`  | Medium-pitched voice |
| `high`    | High-pitched voice   |

### Vocal Style

Identifies the speaker's manner of delivery. Returns an array of labels sorted by descending confidence.

| **Label**    | **Description**                     |
| :----------- | :---------------------------------- |
| `whispering` | Hushed, breathy delivery            |
| `normal`     | Standard conversational speech      |
| `singing`    | Melodic or musical delivery         |
| `mumbling`   | Unclear, low-articulation speech    |
| `crying`     | Speech accompanied by crying        |
| `laughing`   | Speech accompanied by laughter      |
| `shouting`   | Loud, raised-voice delivery         |
| `monotone`   | Flat, unvaried pitch delivery       |
| `unclear`    | Vocal style could not be determined |

### Accent

Detects the speaker's accent or regional dialect using BCP-47 locale codes. Returns an array of labels sorted by descending confidence.

| **Label** | **Region**              |
| :-------- | :---------------------- |
| `en-US`   | American English        |
| `en-GB`   | British English         |
| `en-AU`   | Australian English      |
| `zh-CN`   | Mandarin Chinese        |
| `fr-FR`   | French (France)         |
| `es-ES`   | Spanish (Spain)         |
| `es-419`  | Spanish (Latin America) |
| `es-MX`   | Spanish (Mexico)        |
| `ar-EG`   | Arabic (Egypt)          |

<Note>
  Additional accent locales may be returned beyond those listed above. The model supports a broad range of BCP-47 codes.
</Note>

## Configuration

The STT API accepts both camelCase and snake\_case field names (e.g., `transcribeConfig` / `transcribe_config`, `voiceProfileConfig` / `voice_profile_config`). The examples below use camelCase.

Set `voiceProfileConfig` inside `transcribeConfig`:

```json theme={"system"}
{
  "transcribeConfig": {
    "modelId": "<MODEL_ID>",
    "language": "en-US",
    "audioEncoding": "MP3",
    "voiceProfileConfig": {
      "enableVoiceProfile": true,
      "topN": 5
    }
  }
}
```

Use any STT model that supports Voice Profiles (for example, `groq/whisper-large-v3` for synchronous HTTP, or the `assemblyai/...` streaming models listed in the STT overview).

### WebSocket (Streaming)

Include `voiceProfileConfig` in the first WebSocket message:

```json theme={"system"}
{
  "transcribeConfig": {
    "modelId": "<MODEL_ID>",
    "audioEncoding": "LINEAR16",
    "voiceProfileConfig": {
      "enableVoiceProfile": true,
      "topN": 5
    }
  }
}
```

### Configuration parameters

| **Field**                                     | **Type** | **Default** | **Description**                                                                         |
| :-------------------------------------------- | :------- | :---------- | :-------------------------------------------------------------------------------------- |
| `enable_voice_profile` / `enableVoiceProfile` | bool     | —           | **Required.** Set to `true` to enable Voice Profile analysis for the request or stream. |
| `top_n` / `topN`                              | integer  | `10`        | Number of top labels from each classification category to return.                       |

## Response structure

The `voiceProfile` object is returned alongside `transcription` and `usage` in both sync and streaming responses. Each category is an array of `{ label, confidence }` objects, sorted by descending confidence.

The JSON below shows the **normalized** response shape (camelCase throughout). Raw API payloads may use `snake_case` for the same fields (for example `vocal_style`, `transcribed_audio_ms`, `model_id`). Prefer representing one layer per example in your own docs and client code — either the raw API shape or the normalized shape — not a mix of both.

### Example response (sync, normalized shape)

```json theme={"system"}
{
  "transcription": {
    "transcript": "Hey, I just wanted to check in on the delivery status.",
    "isFinal": true
  },
  "voiceProfile": {
    "age": [
      { "label": "young", "confidence": 0.78 }
    ],
    "emotion": [
      { "label": "tender", "confidence": 0.97 },
      { "label": "sad", "confidence": 0.03 }
    ],
    "pitch": [
      { "label": "medium", "confidence": 0.85 }
    ],
    "vocalStyle": [
      { "label": "whispering", "confidence": 0.97 },
      { "label": "normal", "confidence": 0.03 }
    ],
    "accent": [
      { "label": "en-US", "confidence": 0.48 }
    ]
  },
  "usage": {
    "transcribedAudioMs": 3200,
    "modelId": "inworld/inworld-stt-1"
  }
}
```

### Response fields

| **Field**                 | **Type**      | **Description**                                                                                                                           |
| :------------------------ | :------------ | :---------------------------------------------------------------------------------------------------------------------------------------- |
| `voiceProfile.age`        | ClassLabel\[] | Array of detected age categories, sorted by descending confidence.                                                                        |
| `voiceProfile.emotion`    | ClassLabel\[] | Array of detected emotions, sorted by descending confidence.                                                                              |
| `voiceProfile.pitch`      | ClassLabel\[] | Array of detected pitch levels, sorted by descending confidence. Pitch drift across a conversation can signal changes in emotional state. |
| `voiceProfile.vocalStyle` | ClassLabel\[] | Array of detected vocal styles, sorted by descending confidence.                                                                          |
| `voiceProfile.accent`     | ClassLabel\[] | Array of detected accents as BCP-47 locale codes, sorted by descending confidence.                                                        |

Each `ClassLabel` object contains:

* **label** (string) — The predicted class name
* **confidence** (float) — Score from 0.0 to 1.0

## Best practices

* **Start with the default `topN` (10)** — This returns up to 10 labels per category, sorted by descending confidence. Lower `topN` if you only need the most confident predictions; raise it if you need broader signal.
* **Use emotion and vocal style together** — Combining both categories gives a richer picture. A "tender" emotion with "whispering" vocal style tells a different story than "tender" with "normal" style.
* **Handle missing fields gracefully** — Voice Profile fields may be absent if the model cannot make a confident classification or if the audio quality is insufficient. Always check for presence before accessing.
* **Accent is probabilistic** — Accent detection returns the most likely locale, not a definitive answer. Use it as a signal rather than a hard routing decision.
* **Test with representative audio** — Classification accuracy depends on audio quality, background noise, and speech duration. Test with samples that reflect your production environment.
