Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt

Use this file to discover all available pages before exploring further.

Inworld Router provides moderation endpoints that classify text against safety categories. Use them to screen user input before sending it to an LLM, filter model output before displaying it to users, or moderate content in batch pipelines.
EndpointInputOAI SDK compatibleUse case
/v1/moderationsString or array of stringsSchema-compatibleModerate standalone text
/v1/chat/moderationsChat messagesNoModerate a conversation with configurable scope
Both endpoints return the same classification structure: OpenAI-compatible categories plus AILuminate safety signals.
category_scores values are returned as integers (e.g., 0) rather than floats (e.g., 0.0). If your code expects floats, cast accordingly.

Quickstart

The /v1/moderations endpoint works directly with the OpenAI SDK — just change the base URL:
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inworld.ai/v1",
    api_key="YOUR_INWORLD_API_KEY"
)

response = client.moderations.create(input="Hello world!")
print(response.results[0].flagged)

Conversation moderation

To moderate messages in a chat conversation, use /v1/chat/moderations. The scope parameter controls which messages are evaluated:
scope valueBehavior
"last" (default)Classify only the last message
"all"Classify every message in the conversation
N (positive integer)Classify the last N messages
curl -X POST https://api.inworld.ai/v1/chat/moderations \
  -H "Authorization: Bearer $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello!"},
      {"role": "assistant", "content": "Hi there!"},
      {"role": "user", "content": "Tell me something"}
    ],
    "scope": 2
  }'
Setting scope to "all" or a large number increases response latency because more content needs to be processed. For real-time applications, prefer the default ("last") and only broaden scope when you need full-conversation safety checks.
/v1/chat/moderations is not compatible with the OpenAI SDK — it accepts messages instead of strings and returns a single result object instead of a results array. Use /v1/moderations for SDK compatibility.

Content categories

Each result includes boolean flags and numeric confidence scores for 13 categories:
CategoryDescription
sexualSexual content
sexual/minorsSexual content involving minors
harassmentHarassing language toward any target
harassment/threateningHarassment that includes violence or serious harm
hateHate speech based on protected characteristics
hate/threateningHate speech that includes violence or serious harm
illicitContent advising or describing illicit acts
illicit/violentIllicit content involving violence or weapons
self-harmContent promoting or depicting self-harm
self-harm/intentExpressed intent to engage in self-harm
self-harm/instructionsInstructions for committing self-harm
violenceContent depicting violence toward a person
violence/graphicGraphic depictions of death, violence, or injury
The flagged field is true when any category exceeds the default threshold. Use category_scores (0–1 confidence values) to set custom thresholds for your application.

AILuminate

Both endpoints include an ailuminate object with safety classifications based on the AILuminate benchmark by MLCommons, providing more granular signals beyond the standard OpenAI categories.
FieldTypeDescription
safetystringOverall assessment: "safe", "unsafe", or "controversial"
categoriesobject12 fine-grained safety categories
extensionsobjectAdditional signals: politically_sensitive, unethical_acts, jailbreak
refusalbooleanWhether the content represents a refusal to comply
The safety field classifies content into three levels. "safe" content is benign. "unsafe" content is clearly harmful and always sets flagged: true. "controversial" content falls in between — it may touch sensitive topics without being explicitly harmful. By default, "controversial" content is treated as safe and does not set flagged: true. For stricter moderation, treat "controversial" the same as "unsafe". AILuminate categories: violent_crimes, sex_related_crimes, child_sexual_exploitation, suicide_self_harm, indiscriminate_weapons, intellectual_property, defamation, non_violent_crimes, hate, specialized_advice, privacy, sexual_content The jailbreak extension is particularly useful for detecting prompt injection attempts before they reach your LLM.

Best practices

  • Screen both inputs and outputs. Run moderation on user prompts before sending them to the model and on model responses before displaying to users.
  • Use category_scores for custom thresholds. The flagged boolean uses default thresholds. For your application, you may want stricter thresholds for certain categories (e.g., sexual/minors) and more permissive ones for others.
  • Use scope: "last" for real-time chat. Only broaden to "all" or N when you need full-conversation safety audits and can tolerate higher latency.
  • Batch text inputs. When moderating multiple pieces of content, pass an array to /v1/moderations instead of making separate requests.
  • Combine with other safety layers. Moderation should be one part of your safety strategy alongside system prompts, output filtering, and human review.