Caching - Routeway Docs

Routeway supports two independent caching systems. Understanding which one to reach for will save you time and money:

Prompt Caching

Provider-side KV cache. Repeated prompt prefixes are processed cheaper and faster — you still pay for inference, just at a reduced token rate.

Response Caching

Gateway-level full-response cache. Identical requests are returned instantly from Routeway with no upstream call at all — zero inference cost.

Prompt Caching

Prompt caching lets you reuse the KV (key-value) cache from a previous request when the beginning of your prompt is identical. If the model has already processed a long system prompt or document, cached tokens cost significantly less and return faster.

Routeway automatically passes through cache control headers and prompt caching parameters to underlying providers that support them, including OpenAI and Anthropic.

How Prompt Caching Works

When you send a request, the model processes each token sequentially and stores intermediate computations in a KV cache. On a subsequent request that begins with the same prefix, the model can skip recomputing those tokens and read from the cache instead.

Request 1:  [System prompt: 2,000 tokens] + [User message: 50 tokens]
             └── All tokens computed, cache populated

Request 2:  [System prompt: 2,000 tokens] + [User message: 60 tokens]
             └── System prompt served from cache (cheap!)
             └── Only 60 new tokens computed

Cost reduction

Cached tokens are typically 75–90% cheaper than regular input tokens depending on the provider.

Latency reduction

Cache hits skip prompt processing, reducing time-to-first-token on long-context requests.

Automatic Caching (OpenAI Models)

OpenAI models (gpt-4o, gpt-4o-mini, o3, o4-mini, etc.) cache automatically. No changes to your requests are needed — Routeway forwards the same cache infrastructure OpenAI uses. Cached tokens appear in the usage object:

{
  "usage": {
    "prompt_tokens": 2050,
    "completion_tokens": 120,
    "total_tokens": 2170,
    "prompt_tokens_details": {
      "cached_tokens": 2000,
      "audio_tokens": 0
    }
  }
}

The cache is keyed on the exact byte sequence of your prompt prefix. Even a single character difference at the beginning of a prompt will result in a cache miss.

Maximizing Cache Hits for OpenAI Models

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.routeway.ai/v1",
    api_key=os.getenv("ROUTEWAY_API_KEY")
)

# Put stable, long content at the START of the messages array
SYSTEM_PROMPT = """
You are an expert legal document reviewer with 20 years of experience...
[2,000+ tokens of stable instructions and reference material]
"""

def review_clause(clause_text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},  # cached after first call
            {"role": "user", "content": f"Review this clause:\n\n{clause_text}"}  # changes per request
        ],
    )

    usage = response.usage
    cached = usage.prompt_tokens_details.cached_tokens if usage.prompt_tokens_details else 0
    print(f"Cached tokens: {cached} / {usage.prompt_tokens}")

    return response.choices[0].message.content

Caching for Claude Models

Routeway automatically enables prompt caching for all Claude models when the request meets Anthropic’s minimum cacheable threshold (~2,048 tokens). You no longer need to manually add cache_control blocks — Routeway injects them on eligible content (system prompts, long user messages, and tool definitions) before forwarding the request to Anthropic. We made this the default because most Claude requests that exceed the minimum token threshold benefit from caching, yet many developers were missing out on significant cost savings simply because they weren’t aware of the cache_control parameter. By enabling it automatically, every eligible request gets up to 90% cheaper input token pricing with zero code changes on your side.

Automatic caching applies cache_control breakpoints to the largest stable content blocks in your request. If you already include explicit cache_control blocks, your configuration takes precedence — Routeway will not override or duplicate them.

You can still use explicit cache_control blocks for fine-grained control over exactly which content is cached:

Python (Anthropic SDK)
Python (OpenAI SDK)

import os
import anthropic

client = anthropic.Anthropic(
    base_url="https://api.routeway.ai",
    api_key=os.getenv("ROUTEWAY_API_KEY")
)

# Load a large document once
with open("legal_document.txt") as f:
    document = f.read()

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst. Answer questions about the provided document.",
        },
        {
            "type": "text",
            "text": document,
            "cache_control": {"type": "ephemeral"}  # mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "What are the key termination clauses?"}
    ]
)

# Check cache usage
print(response.usage)
# cache_creation_input_tokens: 8500  (first request — writing to cache)
# cache_read_input_tokens: 0

On subsequent requests with the same document and cache_control:

response2 = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst. Answer questions about the provided document.",
        },
        {
            "type": "text",
            "text": document,
            "cache_control": {"type": "ephemeral"}  # same content = cache hit
        }
    ],
    messages=[
        {"role": "user", "content": "What are the payment terms?"}  # different question
    ]
)

print(response2.usage)
# cache_creation_input_tokens: 0
# cache_read_input_tokens: 8500  (cache hit — 90% cheaper!)

import os
from openai import OpenAI

# Use the OpenAI SDK with Anthropic models via Routeway
client = OpenAI(
    base_url="https://api.routeway.ai/v1",
    api_key=os.getenv("ROUTEWAY_API_KEY")
)

with open("legal_document.txt") as f:
    document = f.read()

response = client.chat.completions.create(
    model="claude-opus-4-5",
    messages=[
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a legal document analyst."},
                {
                    "type": "text",
                    "text": document,
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {"role": "user", "content": "What are the key termination clauses?"}
    ]
)

What to Cache

Not everything is worth caching. Focus on content that is:

Large and stable

System prompts, reference documents, codebases, and knowledge bases that don’t change between requests are ideal cache candidates.Minimum size: Cache only kicks in above ~1,024 tokens (OpenAI) or ~2,048 tokens (Anthropic).

At the beginning of the prompt

The cache matches on prefixes. Stable content must come before the dynamic parts (user messages, session data, etc.).

✓ [Large system prompt] → [User message]
✗ [User message] → [Large system prompt]

Reused across multiple requests

Caching is most valuable when many requests share the same prefix. Single-use prompts don’t benefit from caching.

Cache Lifetime

Provider	Cache TTL	Notes
OpenAI	~5–10 minutes	Auto-managed, no manual control
Anthropic	~5 minutes	Refreshed on each cache hit

Caches expire during periods of low traffic. For long-running workflows, periodically “warm” the cache by making a request that exercises the cached prefix.

Checking Cache Usage

Always log cache usage during development to verify your caching strategy is working:

def log_cache_stats(usage):
    prompt_tokens = usage.prompt_tokens
    cached_tokens = 0

    if hasattr(usage, "prompt_tokens_details") and usage.prompt_tokens_details:
        cached_tokens = usage.prompt_tokens_details.cached_tokens or 0

    cache_ratio = cached_tokens / prompt_tokens if prompt_tokens > 0 else 0
    print(f"Tokens: {prompt_tokens} total, {cached_tokens} cached ({cache_ratio:.0%})")

response = client.chat.completions.create(...)
log_cache_stats(response.usage)

Response Caching

Response caching operates at the Routeway gateway level — before a request ever reaches an upstream provider. When a cache entry exists for an identical request, Routeway returns the stored response immediately. No inference runs, no provider is charged. Repeated identical calls cost $0.

Response caching is separate from prompt caching. Prompt caching reduces the cost of processing a request; response caching eliminates the upstream call entirely.

Zero inference cost

Cache hits bypass the provider completely. Repeated identical requests are free — no tokens consumed, no provider charge.

Instant responses

Cached responses return in milliseconds, with no model latency at all.

Best for deterministic workloads

Classification, batch jobs, FAQ lookups, and retries all benefit most — inputs are predictable and repeat frequently.

Per-key configuration

Caching is enabled and configured per API key, so different use cases can have different TTL policies.

Enabling Response Caching

Response caching is configured per API key. When creating or editing a key in the Dashboard, scroll to the Cache Responses section and toggle it on.

Open API key settings

Go to Dashboard → API Keys and create a new key or click Edit on an existing one.

Enable Cache Responses

Scroll to the Cache Responses section and enable it.

Set the default TTL

Enter how long cached responses should be kept, in seconds.


Minimum	`300` (5 minutes)
Maximum	`86400` (24 hours)
Default	`3600` (1 hour)

Cache Key

Cache entries are scoped to:

API key + model + exact request body

Any change to the request — messages, temperature, max_tokens, or any other field — produces a cache miss and triggers a fresh upstream call.

Per-Request Headers

You can override the key-level cache settings on individual requests using request headers:

Header	Values	Description
`X-Cache`	`true` / `false`	Enable or disable caching for this request, overriding the key default
`X-Cache-TTL`	`<seconds>`	Override the TTL for this request (clamped to 300–86400)
`X-Cache-Clear`	`true`	Invalidate the cached entry for this exact request and force a fresh upstream call

Python
Node.js
cURL

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.routeway.ai/v1",
    api_key=os.getenv("ROUTEWAY_API_KEY")
)

# Cache this request for 10 minutes instead of the key default
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Classify this as spam or not: 'Congratulations, you won!'"}],
    extra_headers={
        "X-Cache": "true",
        "X-Cache-TTL": "600",
    },
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.routeway.ai/v1",
  apiKey: process.env.ROUTEWAY_API_KEY,
});

// Cache this request for 10 minutes instead of the key default
const response = await client.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Classify this as spam or not: 'Congratulations, you won!'" }],
  },
  {
    headers: {
      "X-Cache": "true",
      "X-Cache-TTL": "600",
    },
  }
);

console.log(response.choices[0].message.content);

curl https://api.routeway.ai/v1/chat/completions \
  -H "Authorization: Bearer $ROUTEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Cache: true" \
  -H "X-Cache-TTL: 600" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "Classify this as spam or not: 'Congratulations, you won!'"}
    ]
  }'

Forcing a cache refresh

Send X-Cache-Clear: true to bust the cached entry and force a live upstream call. The fresh response is then stored under the same cache key.

curl https://api.routeway.ai/v1/chat/completions \
  -H "Authorization: Bearer $ROUTEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Cache-Clear: true" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Response Headers

Every response includes headers that tell you what the cache did:

Header	Values	Description
`X-Cache-Status`	`HIT` / `MISS`	Whether the response came from cache
`X-Cache-Age`	`<seconds>`	How old the cached response is (only on `HIT`)
`X-Cache-TTL`	`<seconds>`	Remaining time before the cache entry expires

import httpx
import os

# Use httpx directly to inspect response headers
with httpx.Client() as http:
    r = http.post(
        "https://api.routeway.ai/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {os.getenv('ROUTEWAY_API_KEY')}",
            "Content-Type": "application/json",
            "X-Cache": "true",
        },
        json={
            "model": "gpt-4o-mini",
            "messages": [{"role": "user", "content": "What is the capital of France?"}],
        },
    )

print(r.headers.get("X-Cache-Status"))  # HIT or MISS
print(r.headers.get("X-Cache-Age"))     # e.g. 42 (seconds since cached)
print(r.headers.get("X-Cache-TTL"))     # e.g. 3558 (seconds remaining)

When to Use Response Caching

Classification and labelling

Spam detection, sentiment analysis, intent routing, and content moderation often receive duplicate inputs. Cache the result the first time and serve it instantly on repeats.

Batch jobs with retries

If a batch pipeline re-processes records or retries failed items, identical inputs will hit the cache instead of re-running inference and being billed again.

FAQ and knowledge-base lookups

A fixed set of questions (product FAQs, support macros, help-centre entries) will cache on the first request and return for free indefinitely within the TTL.

Development and testing

Avoid burning credits on repeated test calls. Enable caching during dev and set a long TTL so identical prompts are free after the first run.

Response caching is not suited for free-form conversational chat. Multi-turn conversations almost never produce identical request bodies, so cache hit rates will be near zero and the overhead adds no value.

Prompt Caching

Response Caching

​Prompt Caching

​How Prompt Caching Works

Cost reduction

Latency reduction

​Automatic Caching (OpenAI Models)

​Maximizing Cache Hits for OpenAI Models

​Caching for Claude Models

​What to Cache

​Cache Lifetime

​Checking Cache Usage

​Response Caching

Zero inference cost

Instant responses

Best for deterministic workloads

Per-key configuration

​Enabling Response Caching

​Cache Key

​Per-Request Headers

​Forcing a cache refresh

​Response Headers

​When to Use Response Caching

Prompt Caching

How Prompt Caching Works

Automatic Caching (OpenAI Models)

Maximizing Cache Hits for OpenAI Models

Caching for Claude Models

What to Cache

Cache Lifetime

Checking Cache Usage

Response Caching

Enabling Response Caching

Cache Key

Per-Request Headers

Forcing a cache refresh

Response Headers

When to Use Response Caching