Documentation Index
Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
Inworld Router provides moderation endpoints that classify text against safety categories. Use them to screen user input before sending it to an LLM, filter model output before displaying it to users, or moderate content in batch pipelines.
| Endpoint | Input | OAI SDK compatible | Use case |
|---|
/v1/moderations | String or array of strings | Schema-compatible | Moderate standalone text |
/v1/chat/moderations | Chat messages | No | Moderate a conversation with configurable scope |
Both endpoints return the same classification structure: OpenAI-compatible categories plus AILuminate safety signals.
category_scores values are returned as integers (e.g., 0) rather than floats (e.g., 0.0). If your code expects floats, cast accordingly.
Quickstart
The /v1/moderations endpoint works directly with the OpenAI SDK — just change the base URL:
from openai import OpenAI
client = OpenAI(
base_url="https://api.inworld.ai/v1",
api_key="YOUR_INWORLD_API_KEY"
)
response = client.moderations.create(input="Hello world!")
print(response.results[0].flagged)
Conversation moderation
To moderate messages in a chat conversation, use /v1/chat/moderations. The scope parameter controls which messages are evaluated:
scope value | Behavior |
|---|
"last" (default) | Classify only the last message |
"all" | Classify every message in the conversation |
N (positive integer) | Classify the last N messages |
curl -X POST https://api.inworld.ai/v1/chat/moderations \
-H "Authorization: Bearer $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "Tell me something"}
],
"scope": 2
}'
Setting scope to "all" or a large number increases response latency because more content needs to be processed. For real-time applications, prefer the default ("last") and only broaden scope when you need full-conversation safety checks.
/v1/chat/moderations is not compatible with the OpenAI SDK — it accepts messages instead of strings and returns a single result object instead of a results array. Use /v1/moderations for SDK compatibility.
Content categories
Each result includes boolean flags and numeric confidence scores for 13 categories:
| Category | Description |
|---|
sexual | Sexual content |
sexual/minors | Sexual content involving minors |
harassment | Harassing language toward any target |
harassment/threatening | Harassment that includes violence or serious harm |
hate | Hate speech based on protected characteristics |
hate/threatening | Hate speech that includes violence or serious harm |
illicit | Content advising or describing illicit acts |
illicit/violent | Illicit content involving violence or weapons |
self-harm | Content promoting or depicting self-harm |
self-harm/intent | Expressed intent to engage in self-harm |
self-harm/instructions | Instructions for committing self-harm |
violence | Content depicting violence toward a person |
violence/graphic | Graphic depictions of death, violence, or injury |
The flagged field is true when any category exceeds the default threshold. Use category_scores (0–1 confidence values) to set custom thresholds for your application.
AILuminate
Both endpoints include an ailuminate object with safety classifications based on the AILuminate benchmark by MLCommons, providing more granular signals beyond the standard OpenAI categories.
| Field | Type | Description |
|---|
safety | string | Overall assessment: "safe", "unsafe", or "controversial" |
categories | object | 12 fine-grained safety categories |
extensions | object | Additional signals: politically_sensitive, unethical_acts, jailbreak |
refusal | boolean | Whether the content represents a refusal to comply |
The safety field classifies content into three levels. "safe" content is benign. "unsafe" content is clearly harmful and always sets flagged: true. "controversial" content falls in between — it may touch sensitive topics without being explicitly harmful. By default, "controversial" content is treated as safe and does not set flagged: true. For stricter moderation, treat "controversial" the same as "unsafe".
AILuminate categories: violent_crimes, sex_related_crimes, child_sexual_exploitation, suicide_self_harm, indiscriminate_weapons, intellectual_property, defamation, non_violent_crimes, hate, specialized_advice, privacy, sexual_content
The jailbreak extension is particularly useful for detecting prompt injection attempts before they reach your LLM.
Best practices
- Screen both inputs and outputs. Run moderation on user prompts before sending them to the model and on model responses before displaying to users.
- Use
category_scores for custom thresholds. The flagged boolean uses default thresholds. For your application, you may want stricter thresholds for certain categories (e.g., sexual/minors) and more permissive ones for others.
- Use
scope: "last" for real-time chat. Only broaden to "all" or N when you need full-conversation safety audits and can tolerate higher latency.
- Batch text inputs. When moderating multiple pieces of content, pass an array to
/v1/moderations instead of making separate requests.
- Combine with other safety layers. Moderation should be one part of your safety strategy alongside system prompts, output filtering, and human review.