Prompt Caching
Provider-side KV cache. Repeated prompt prefixes are processed cheaper and faster — you still pay for inference, just at a reduced token rate.
Response Caching
Gateway-level full-response cache. Identical requests are returned instantly from Routeway with no upstream call at all — zero inference cost.
Prompt Caching
Prompt caching lets you reuse the KV (key-value) cache from a previous request when the beginning of your prompt is identical. If the model has already processed a long system prompt or document, cached tokens cost significantly less and return faster.Routeway automatically passes through cache control headers and prompt caching parameters to underlying providers that support them, including OpenAI and Anthropic.
How Prompt Caching Works
When you send a request, the model processes each token sequentially and stores intermediate computations in a KV cache. On a subsequent request that begins with the same prefix, the model can skip recomputing those tokens and read from the cache instead.Cost reduction
Cached tokens are typically 75–90% cheaper than regular input tokens depending on the provider.
Latency reduction
Cache hits skip prompt processing, reducing time-to-first-token on long-context requests.
Automatic Caching (OpenAI Models)
OpenAI models (gpt-4o, gpt-4o-mini, o3, o4-mini, etc.) cache automatically. No changes to your requests are needed — Routeway forwards the same cache infrastructure OpenAI uses.
Cached tokens appear in the usage object:
Maximizing Cache Hits for OpenAI Models
Caching for Claude Models
Routeway automatically enables prompt caching for all Claude models when the request meets Anthropic’s minimum cacheable threshold (~2,048 tokens). You no longer need to manually addcache_control blocks — Routeway injects them on eligible content (system prompts, long user messages, and tool definitions) before forwarding the request to Anthropic.
We made this the default because most Claude requests that exceed the minimum token threshold benefit from caching, yet many developers were missing out on significant cost savings simply because they weren’t aware of the cache_control parameter. By enabling it automatically, every eligible request gets up to 90% cheaper input token pricing with zero code changes on your side.
Automatic caching applies
cache_control breakpoints to the largest stable content blocks in your request. If you already include explicit cache_control blocks, your configuration takes precedence — Routeway will not override or duplicate them.cache_control blocks for fine-grained control over exactly which content is cached:
- Python (Anthropic SDK)
- Python (OpenAI SDK)
cache_control:What to Cache
Not everything is worth caching. Focus on content that is:Large and stable
Large and stable
System prompts, reference documents, codebases, and knowledge bases that don’t change between requests are ideal cache candidates.Minimum size: Cache only kicks in above ~1,024 tokens (OpenAI) or ~2,048 tokens (Anthropic).
At the beginning of the prompt
At the beginning of the prompt
The cache matches on prefixes. Stable content must come before the dynamic parts (user messages, session data, etc.).
Reused across multiple requests
Reused across multiple requests
Caching is most valuable when many requests share the same prefix. Single-use prompts don’t benefit from caching.
Cache Lifetime
| Provider | Cache TTL | Notes |
|---|---|---|
| OpenAI | ~5–10 minutes | Auto-managed, no manual control |
| Anthropic | ~5 minutes | Refreshed on each cache hit |
Checking Cache Usage
Always log cache usage during development to verify your caching strategy is working:Response Caching
Response caching operates at the Routeway gateway level — before a request ever reaches an upstream provider. When a cache entry exists for an identical request, Routeway returns the stored response immediately. No inference runs, no provider is charged. Repeated identical calls cost $0.Response caching is separate from prompt caching. Prompt caching reduces the cost of processing a request; response caching eliminates the upstream call entirely.
Zero inference cost
Cache hits bypass the provider completely. Repeated identical requests are free — no tokens consumed, no provider charge.
Instant responses
Cached responses return in milliseconds, with no model latency at all.
Best for deterministic workloads
Classification, batch jobs, FAQ lookups, and retries all benefit most — inputs are predictable and repeat frequently.
Per-key configuration
Caching is enabled and configured per API key, so different use cases can have different TTL policies.
Enabling Response Caching
Response caching is configured per API key. When creating or editing a key in the Dashboard, scroll to the Cache Responses section and toggle it on.Open API key settings
Go to Dashboard → API Keys and create a new key or click Edit on an existing one.
Cache Key
Cache entries are scoped to:temperature, max_tokens, or any other field — produces a cache miss and triggers a fresh upstream call.
Per-Request Headers
You can override the key-level cache settings on individual requests using request headers:| Header | Values | Description |
|---|---|---|
X-Cache | true / false | Enable or disable caching for this request, overriding the key default |
X-Cache-TTL | <seconds> | Override the TTL for this request (clamped to 300–86400) |
X-Cache-Clear | true | Invalidate the cached entry for this exact request and force a fresh upstream call |
- Python
- Node.js
- cURL
Forcing a cache refresh
SendX-Cache-Clear: true to bust the cached entry and force a live upstream call. The fresh response is then stored under the same cache key.
Response Headers
Every response includes headers that tell you what the cache did:| Header | Values | Description |
|---|---|---|
X-Cache-Status | HIT / MISS | Whether the response came from cache |
X-Cache-Age | <seconds> | How old the cached response is (only on HIT) |
X-Cache-TTL | <seconds> | Remaining time before the cache entry expires |
When to Use Response Caching
Classification and labelling
Classification and labelling
Spam detection, sentiment analysis, intent routing, and content moderation often receive duplicate inputs. Cache the result the first time and serve it instantly on repeats.
Batch jobs with retries
Batch jobs with retries
If a batch pipeline re-processes records or retries failed items, identical inputs will hit the cache instead of re-running inference and being billed again.
FAQ and knowledge-base lookups
FAQ and knowledge-base lookups
A fixed set of questions (product FAQs, support macros, help-centre entries) will cache on the first request and return for free indefinitely within the TTL.
Development and testing
Development and testing
Avoid burning credits on repeated test calls. Enable caching during dev and set a long TTL so identical prompts are free after the first run.