> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Prompt compression

> Reduce token usage by compressing long messages before sending them to the LLM

Prompt compression automatically shortens long messages before they reach the LLM. This reduces the number of input tokens without meaningfully changing the response quality — saving cost on every request.

Compression works best on long system prompts, detailed instructions, and context-heavy messages. Messages shorter than 250 tokens are skipped automatically.

<Warning>
  Compression may produce unexpected results on structured content such as JSON, XML, or code. Avoid compressing messages that contain structured data the model needs to parse exactly.
</Warning>

## Router-level compression

Set `compression` on a router variant to compress all message templates in that variant. This is the most common setup - configure once, and every request through this variant benefits from compression.

```json Router with compression theme={"system"}
{
  "name": "routers/my-router",
  "display_name": "Compressed prompts",
  "default_route": {
    "route_id": "default",
    "variants": [{
      "weight": 100,
      "variant": {
        "variant_id": "v1",
        "model_id": "openai/gpt-4o",
        "compression": {
          "aggressiveness": 0.7
        },
        "message_templates": [
          {
            "role": "system",
            "content": "You are a helpful assistant with deep expertise in world history, geopolitics, and international relations. Your primary responsibility is to provide detailed, engaging answers..."
          }
        ]
      }
    }]
  }
}
```

| Parameter        | Type    | Range   | Default | Description                                                                           |
| ---------------- | ------- | ------- | ------- | ------------------------------------------------------------------------------------- |
| `aggressiveness` | `float` | 0.0–1.0 | 0.5     | How aggressively to compress. Higher values save more tokens but may reduce fidelity. |

<Note>
  Variant-level compression only applies to the router's message templates. It does not affect messages sent by the user in the request. Individual templates can override the variant-level aggressiveness by setting their own `compression` field.
</Note>

## Message-level compression

You can compress specific messages in a chat completion request by adding a `compression` field to any message. Consecutive messages with the same aggressiveness are batched and compressed together. Non-consecutive compressed messages are processed separately.

```bash theme={"system"}
curl -X POST https://api.inworld.ai/v1/chat/completions \
  -H 'Authorization: Bearer <your-api-key>' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "You are an expert assistant with extensive knowledge of...",
        "compression": { "aggressiveness": 0.5 }
      },
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'
```

Only messages with the `compression` field are compressed. Other messages are sent as-is.

## Response

The response includes compression stats in `metadata.compression`:

```json theme={"system"}
{
  "metadata": {
    "compression": {
      "original_tokens": 190,
      "compressed_tokens": 102,
      "saved_tokens": 88
    }
  }
}
```

| Field               | Description                            |
| ------------------- | -------------------------------------- |
| `original_tokens`   | Total tokens before compression        |
| `compressed_tokens` | Total tokens after compression         |
| `saved_tokens`      | Tokens saved (`original - compressed`) |

If no messages were compressed, the `compression` field is omitted from the response.

## Warnings

When compression is enabled but skipped or fails for specific messages, the response includes warnings in `metadata.attempts[].warnings`:

```json theme={"system"}
{
  "metadata": {
    "attempts": [{
      "model": "openai/gpt-4o",
      "success": true,
      "warnings": [
        "Prompt compression skipped for message #0 (system: 'You are a helpful...') — text too short (min 250 tokens)"
      ]
    }]
  }
}
```

Warnings appear when:

* A message has compression enabled but is too short (below 250 tokens)
* The compression service fails for a message (the message is sent uncompressed)
