Moderations - Inworld AI Documentation

Inworld Router provides moderation endpoints that classify text against safety categories. Use them to screen user input before sending it to an LLM, filter model output before displaying it to users, or moderate content in batch pipelines.

Endpoint	Input	OAI SDK compatible	Use case
`/v1/moderations`	String or array of strings	Schema-compatible	Moderate standalone text
`/v1/chat/moderations`	Chat messages	No	Moderate a conversation with configurable scope

Both endpoints return the same classification structure: OpenAI-compatible categories plus AILuminate safety signals.

category_scores values are returned as integers (e.g., 0) rather than floats (e.g., 0.0). If your code expects floats, cast accordingly.

Quickstart

The /v1/moderations endpoint works directly with the OpenAI SDK — just change the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inworld.ai/v1",
    api_key="YOUR_INWORLD_API_KEY"
)

response = client.moderations.create(input="Hello world!")
print(response.results[0].flagged)

Conversation moderation

To moderate messages in a chat conversation, use /v1/chat/moderations. The scope parameter controls which messages are evaluated:

`scope` value	Behavior
`"last"` (default)	Classify only the last message
`"all"`	Classify every message in the conversation
`N` (positive integer)	Classify the last `N` messages

curl -X POST https://api.inworld.ai/v1/chat/moderations \
  -H "Authorization: Bearer $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello!"},
      {"role": "assistant", "content": "Hi there!"},
      {"role": "user", "content": "Tell me something"}
    ],
    "scope": 2
  }'

Setting scope to "all" or a large number increases response latency because more content needs to be processed. For real-time applications, prefer the default ("last") and only broaden scope when you need full-conversation safety checks.

/v1/chat/moderations is not compatible with the OpenAI SDK — it accepts messages instead of strings and returns a single result object instead of a results array. Use /v1/moderations for SDK compatibility.

Content categories

Each result includes boolean flags and numeric confidence scores for 13 categories:

Category	Description
`sexual`	Sexual content
`sexual/minors`	Sexual content involving minors
`harassment`	Harassing language toward any target
`harassment/threatening`	Harassment that includes violence or serious harm
`hate`	Hate speech based on protected characteristics
`hate/threatening`	Hate speech that includes violence or serious harm
`illicit`	Content advising or describing illicit acts
`illicit/violent`	Illicit content involving violence or weapons
`self-harm`	Content promoting or depicting self-harm
`self-harm/intent`	Expressed intent to engage in self-harm
`self-harm/instructions`	Instructions for committing self-harm
`violence`	Content depicting violence toward a person
`violence/graphic`	Graphic depictions of death, violence, or injury

The flagged field is true when any category exceeds the default threshold. Use category_scores (0–1 confidence values) to set custom thresholds for your application.

AILuminate

Both endpoints include an ailuminate object with safety classifications based on the AILuminate benchmark by MLCommons, providing more granular signals beyond the standard OpenAI categories.

Field	Type	Description
`safety`	`string`	Overall assessment: `"safe"`, `"unsafe"`, or `"controversial"`
`categories`	`object`	12 fine-grained safety categories
`extensions`	`object`	Additional signals: `politically_sensitive`, `unethical_acts`, `jailbreak`
`refusal`	`boolean`	Whether the content represents a refusal to comply

The safety field classifies content into three levels. "safe" content is benign. "unsafe" content is clearly harmful and always sets flagged: true. "controversial" content falls in between — it may touch sensitive topics without being explicitly harmful. By default, "controversial" content is treated as safe and does not set flagged: true. For stricter moderation, treat "controversial" the same as "unsafe". AILuminate categories: violent_crimes, sex_related_crimes, child_sexual_exploitation, suicide_self_harm, indiscriminate_weapons, intellectual_property, defamation, non_violent_crimes, hate, specialized_advice, privacy, sexual_content The jailbreak extension is particularly useful for detecting prompt injection attempts before they reach your LLM.

Best practices

Screen both inputs and outputs. Run moderation on user prompts before sending them to the model and on model responses before displaying to users.
Use category_scores for custom thresholds. The flagged boolean uses default thresholds. For your application, you may want stricter thresholds for certain categories (e.g., sexual/minors) and more permissive ones for others.
Use scope: "last" for real-time chat. Only broaden to "all" or N when you need full-conversation safety audits and can tolerate higher latency.
Batch text inputs. When moderating multiple pieces of content, pass an array to /v1/moderations instead of making separate requests.
Combine with other safety layers. Moderation should be one part of your safety strategy alongside system prompts, output filtering, and human review.

Documentation Index

​Quickstart

​Conversation moderation

​Content categories

​AILuminate

​Best practices

Quickstart

Conversation moderation

Content categories

AILuminate

Best practices