Tokens & Context - Routeway Docs

Every piece of text you send to or receive from a model is measured in tokens. Understanding tokens helps you estimate costs, avoid context-window errors, and build efficient pipelines.

What Is a Token?

A token is a chunk of text — roughly 3–4 characters or about 0.75 words in English. Tokenization is not simply splitting on spaces; punctuation, subwords, and whitespace each contribute their own tokens.

Text	Approximate tokens
`"Hello, world!"`	4
`"Explain quantum entanglement."`	5
A typical paragraph (100 words)	~130
A full A4 page of text	~500–700
A 10,000-word document	~13,000

Token counts vary slightly by model because each provider uses a different tokenizer. The figures above are approximate. The usage field in every API response gives you the exact counts for that call.

Token Types in a Response

The usage object returned with every completion breaks down token usage:

{
  "usage": {
    "prompt_tokens": 312,
    "completion_tokens": 87,
    "total_tokens": 399
  }
}

Field	What it counts
`prompt_tokens`	All tokens in your `messages` array, including system prompts
`completion_tokens`	Tokens generated by the model in this response
`total_tokens`	Sum of the above

On reasoning models, completion_tokens includes internal reasoning tokens. Some responses include a completion_tokens_details breakdown.

Context Windows

A model’s context window is the maximum number of tokens it can process in a single request — the sum of prompt_tokens and completion_tokens combined.

Model	Context window
`gpt-4o-mini`	128,000 tokens
`gpt-4o`	128,000 tokens
`o3`, `o4-mini`	200,000 tokens
`claude-opus-4-5`	200,000 tokens
`gemini-2.5-pro`	1,000,000 tokens

Sending a request that exceeds the context window returns a 400 error. Always leave headroom for the model’s output — if the window is 128K and your prompt is 127K tokens, the model has almost no room to respond.

Controlling Output Length

Use max_tokens to cap how many tokens the model generates. This prevents runaway costs and enforces response length for your use case.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize the French Revolution."}],
    max_tokens=200,  # model will stop after ~200 output tokens
)

If the model hits max_tokens before finishing, finish_reason will be "length" instead of "stop". The response is truncated — check finish_reason in production to detect this.

Managing Long Conversations

Because the full messages array is sent on every request, conversation costs grow with each turn. For long sessions, use one of these strategies:

Sliding window

Keep only the last N turns in the messages array, always preserving the system message at the start.

MAX_TURNS = 10  # keep last 10 user+assistant pairs

def trim_messages(messages):
    system = [m for m in messages if m["role"] == "system"]
    rest = [m for m in messages if m["role"] != "system"]
    return system + rest[-MAX_TURNS * 2:]  # 2 messages per turn

Summarization

When history gets long, ask the model to summarize it, then replace the old messages with the summary.

summary_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Summarize this conversation in 3-5 sentences, preserving key facts."},
        *old_messages
    ]
)
summary = summary_response.choices[0].message.content

# Replace history with summary
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "assistant", "content": f"[Conversation summary: {summary}]"},
]

Prompt caching

If your system prompt or context is large and stable across many requests, enable Prompt Caching. The first request processes and caches the prefix; subsequent requests with the same prefix pay a fraction of the cost.

Estimating Costs Before Sending

You can estimate token usage before making a request by counting tokens locally. The tiktoken library implements OpenAI’s tokenizer:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(messages):
    total = 0
    for msg in messages:
        total += 4  # overhead per message
        for value in msg.values():
            total += len(enc.encode(str(value)))
    total += 2  # priming tokens
    return total

tokens = count_tokens(messages)
print(f"Estimated prompt tokens: {tokens}")

This is an approximation. For billing purposes, always use the usage values returned in the actual API response.

Token Cost Summary

Token type	Billing
Prompt tokens	Charged per model’s input rate
Completion tokens	Charged per model’s output rate
Cached prompt tokens	Discounted (typically 50–75% off input rate)
Reasoning tokens	Billed as output tokens on most models

See the Models page for per-model rates and the Billing page to check your current balance and usage.

​What Is a Token?

​Token Types in a Response

​Context Windows

​Controlling Output Length

​Managing Long Conversations

​Estimating Costs Before Sending

​Token Cost Summary

What Is a Token?

Token Types in a Response

Context Windows

Controlling Output Length

Managing Long Conversations

Estimating Costs Before Sending

Token Cost Summary