Skip to main content
Routeway supports two independent caching systems. Understanding which one to reach for will save you time and money:

Prompt Caching

Provider-side KV cache. Repeated prompt prefixes are processed cheaper and faster — you still pay for inference, just at a reduced token rate.

Response Caching

Gateway-level full-response cache. Identical requests are returned instantly from Routeway with no upstream call at all — zero inference cost.

Prompt Caching

Prompt caching lets you reuse the KV (key-value) cache from a previous request when the beginning of your prompt is identical. If the model has already processed a long system prompt or document, cached tokens cost significantly less and return faster.
Routeway automatically passes through cache control headers and prompt caching parameters to underlying providers that support them, including OpenAI and Anthropic.

How Prompt Caching Works

When you send a request, the model processes each token sequentially and stores intermediate computations in a KV cache. On a subsequent request that begins with the same prefix, the model can skip recomputing those tokens and read from the cache instead.
Request 1:  [System prompt: 2,000 tokens] + [User message: 50 tokens]
             └── All tokens computed, cache populated

Request 2:  [System prompt: 2,000 tokens] + [User message: 60 tokens]
             └── System prompt served from cache (cheap!)
             └── Only 60 new tokens computed

Cost reduction

Cached tokens are typically 75–90% cheaper than regular input tokens depending on the provider.

Latency reduction

Cache hits skip prompt processing, reducing time-to-first-token on long-context requests.

Automatic Caching (OpenAI Models)

OpenAI models (gpt-4o, gpt-4o-mini, o3, o4-mini, etc.) cache automatically. No changes to your requests are needed — Routeway forwards the same cache infrastructure OpenAI uses. Cached tokens appear in the usage object:
{
  "usage": {
    "prompt_tokens": 2050,
    "completion_tokens": 120,
    "total_tokens": 2170,
    "prompt_tokens_details": {
      "cached_tokens": 2000,
      "audio_tokens": 0
    }
  }
}
The cache is keyed on the exact byte sequence of your prompt prefix. Even a single character difference at the beginning of a prompt will result in a cache miss.

Maximizing Cache Hits for OpenAI Models

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.routeway.ai/v1",
    api_key=os.getenv("ROUTEWAY_API_KEY")
)

# Put stable, long content at the START of the messages array
SYSTEM_PROMPT = """
You are an expert legal document reviewer with 20 years of experience...
[2,000+ tokens of stable instructions and reference material]
"""

def review_clause(clause_text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},  # cached after first call
            {"role": "user", "content": f"Review this clause:\n\n{clause_text}"}  # changes per request
        ],
    )

    usage = response.usage
    cached = usage.prompt_tokens_details.cached_tokens if usage.prompt_tokens_details else 0
    print(f"Cached tokens: {cached} / {usage.prompt_tokens}")

    return response.choices[0].message.content

Caching for Claude Models

Routeway automatically enables prompt caching for all Claude models when the request meets Anthropic’s minimum cacheable threshold (~2,048 tokens). You no longer need to manually add cache_control blocks — Routeway injects them on eligible content (system prompts, long user messages, and tool definitions) before forwarding the request to Anthropic. We made this the default because most Claude requests that exceed the minimum token threshold benefit from caching, yet many developers were missing out on significant cost savings simply because they weren’t aware of the cache_control parameter. By enabling it automatically, every eligible request gets up to 90% cheaper input token pricing with zero code changes on your side.
Automatic caching applies cache_control breakpoints to the largest stable content blocks in your request. If you already include explicit cache_control blocks, your configuration takes precedence — Routeway will not override or duplicate them.
You can still use explicit cache_control blocks for fine-grained control over exactly which content is cached:
import os
import anthropic

client = anthropic.Anthropic(
    base_url="https://api.routeway.ai",
    api_key=os.getenv("ROUTEWAY_API_KEY")
)

# Load a large document once
with open("legal_document.txt") as f:
    document = f.read()

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst. Answer questions about the provided document.",
        },
        {
            "type": "text",
            "text": document,
            "cache_control": {"type": "ephemeral"}  # mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "What are the key termination clauses?"}
    ]
)

# Check cache usage
print(response.usage)
# cache_creation_input_tokens: 8500  (first request — writing to cache)
# cache_read_input_tokens: 0
On subsequent requests with the same document and cache_control:
response2 = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst. Answer questions about the provided document.",
        },
        {
            "type": "text",
            "text": document,
            "cache_control": {"type": "ephemeral"}  # same content = cache hit
        }
    ],
    messages=[
        {"role": "user", "content": "What are the payment terms?"}  # different question
    ]
)

print(response2.usage)
# cache_creation_input_tokens: 0
# cache_read_input_tokens: 8500  (cache hit — 90% cheaper!)

What to Cache

Not everything is worth caching. Focus on content that is:
System prompts, reference documents, codebases, and knowledge bases that don’t change between requests are ideal cache candidates.Minimum size: Cache only kicks in above ~1,024 tokens (OpenAI) or ~2,048 tokens (Anthropic).
The cache matches on prefixes. Stable content must come before the dynamic parts (user messages, session data, etc.).
✓ [Large system prompt] → [User message]
✗ [User message] → [Large system prompt]
Caching is most valuable when many requests share the same prefix. Single-use prompts don’t benefit from caching.

Cache Lifetime

ProviderCache TTLNotes
OpenAI~5–10 minutesAuto-managed, no manual control
Anthropic~5 minutesRefreshed on each cache hit
Caches expire during periods of low traffic. For long-running workflows, periodically “warm” the cache by making a request that exercises the cached prefix.

Checking Cache Usage

Always log cache usage during development to verify your caching strategy is working:
def log_cache_stats(usage):
    prompt_tokens = usage.prompt_tokens
    cached_tokens = 0

    if hasattr(usage, "prompt_tokens_details") and usage.prompt_tokens_details:
        cached_tokens = usage.prompt_tokens_details.cached_tokens or 0

    cache_ratio = cached_tokens / prompt_tokens if prompt_tokens > 0 else 0
    print(f"Tokens: {prompt_tokens} total, {cached_tokens} cached ({cache_ratio:.0%})")

response = client.chat.completions.create(...)
log_cache_stats(response.usage)

Response Caching

Response caching operates at the Routeway gateway level — before a request ever reaches an upstream provider. When a cache entry exists for an identical request, Routeway returns the stored response immediately. No inference runs, no provider is charged. Repeated identical calls cost $0.
Response caching is separate from prompt caching. Prompt caching reduces the cost of processing a request; response caching eliminates the upstream call entirely.

Zero inference cost

Cache hits bypass the provider completely. Repeated identical requests are free — no tokens consumed, no provider charge.

Instant responses

Cached responses return in milliseconds, with no model latency at all.

Best for deterministic workloads

Classification, batch jobs, FAQ lookups, and retries all benefit most — inputs are predictable and repeat frequently.

Per-key configuration

Caching is enabled and configured per API key, so different use cases can have different TTL policies.

Enabling Response Caching

Response caching is configured per API key. When creating or editing a key in the Dashboard, scroll to the Cache Responses section and toggle it on.
1

Open API key settings

Go to Dashboard → API Keys and create a new key or click Edit on an existing one.
2

Enable Cache Responses

Scroll to the Cache Responses section and enable it.
3

Set the default TTL

Enter how long cached responses should be kept, in seconds.
Minimum300 (5 minutes)
Maximum86400 (24 hours)
Default3600 (1 hour)

Cache Key

Cache entries are scoped to:
API key + model + exact request body
Any change to the request — messages, temperature, max_tokens, or any other field — produces a cache miss and triggers a fresh upstream call.

Per-Request Headers

You can override the key-level cache settings on individual requests using request headers:
HeaderValuesDescription
X-Cachetrue / falseEnable or disable caching for this request, overriding the key default
X-Cache-TTL<seconds>Override the TTL for this request (clamped to 300–86400)
X-Cache-CleartrueInvalidate the cached entry for this exact request and force a fresh upstream call
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.routeway.ai/v1",
    api_key=os.getenv("ROUTEWAY_API_KEY")
)

# Cache this request for 10 minutes instead of the key default
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Classify this as spam or not: 'Congratulations, you won!'"}],
    extra_headers={
        "X-Cache": "true",
        "X-Cache-TTL": "600",
    },
)

print(response.choices[0].message.content)

Forcing a cache refresh

Send X-Cache-Clear: true to bust the cached entry and force a live upstream call. The fresh response is then stored under the same cache key.
curl https://api.routeway.ai/v1/chat/completions \
  -H "Authorization: Bearer $ROUTEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Cache-Clear: true" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Response Headers

Every response includes headers that tell you what the cache did:
HeaderValuesDescription
X-Cache-StatusHIT / MISSWhether the response came from cache
X-Cache-Age<seconds>How old the cached response is (only on HIT)
X-Cache-TTL<seconds>Remaining time before the cache entry expires
import httpx
import os

# Use httpx directly to inspect response headers
with httpx.Client() as http:
    r = http.post(
        "https://api.routeway.ai/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {os.getenv('ROUTEWAY_API_KEY')}",
            "Content-Type": "application/json",
            "X-Cache": "true",
        },
        json={
            "model": "gpt-4o-mini",
            "messages": [{"role": "user", "content": "What is the capital of France?"}],
        },
    )

print(r.headers.get("X-Cache-Status"))  # HIT or MISS
print(r.headers.get("X-Cache-Age"))     # e.g. 42 (seconds since cached)
print(r.headers.get("X-Cache-TTL"))     # e.g. 3558 (seconds remaining)

When to Use Response Caching

Spam detection, sentiment analysis, intent routing, and content moderation often receive duplicate inputs. Cache the result the first time and serve it instantly on repeats.
If a batch pipeline re-processes records or retries failed items, identical inputs will hit the cache instead of re-running inference and being billed again.
A fixed set of questions (product FAQs, support macros, help-centre entries) will cache on the first request and return for free indefinitely within the TTL.
Avoid burning credits on repeated test calls. Enable caching during dev and set a long TTL so identical prompts are free after the first run.
Response caching is not suited for free-form conversational chat. Multi-turn conversations almost never produce identical request bodies, so cache hit rates will be near zero and the overhead adds no value.