Skip to main content
Every piece of text you send to or receive from a model is measured in tokens. Understanding tokens helps you estimate costs, avoid context-window errors, and build efficient pipelines.

What Is a Token?

A token is a chunk of text — roughly 3–4 characters or about 0.75 words in English. Tokenization is not simply splitting on spaces; punctuation, subwords, and whitespace each contribute their own tokens.
TextApproximate tokens
"Hello, world!"4
"Explain quantum entanglement."5
A typical paragraph (100 words)~130
A full A4 page of text~500–700
A 10,000-word document~13,000
Token counts vary slightly by model because each provider uses a different tokenizer. The figures above are approximate. The usage field in every API response gives you the exact counts for that call.

Token Types in a Response

The usage object returned with every completion breaks down token usage:
{
  "usage": {
    "prompt_tokens": 312,
    "completion_tokens": 87,
    "total_tokens": 399
  }
}
FieldWhat it counts
prompt_tokensAll tokens in your messages array, including system prompts
completion_tokensTokens generated by the model in this response
total_tokensSum of the above
On reasoning models, completion_tokens includes internal reasoning tokens. Some responses include a completion_tokens_details breakdown.

Context Windows

A model’s context window is the maximum number of tokens it can process in a single request — the sum of prompt_tokens and completion_tokens combined.
ModelContext window
gpt-4o-mini128,000 tokens
gpt-4o128,000 tokens
o3, o4-mini200,000 tokens
claude-opus-4-5200,000 tokens
gemini-2.5-pro1,000,000 tokens
Sending a request that exceeds the context window returns a 400 error. Always leave headroom for the model’s output — if the window is 128K and your prompt is 127K tokens, the model has almost no room to respond.

Controlling Output Length

Use max_tokens to cap how many tokens the model generates. This prevents runaway costs and enforces response length for your use case.
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize the French Revolution."}],
    max_tokens=200,  # model will stop after ~200 output tokens
)
If the model hits max_tokens before finishing, finish_reason will be "length" instead of "stop". The response is truncated — check finish_reason in production to detect this.

Managing Long Conversations

Because the full messages array is sent on every request, conversation costs grow with each turn. For long sessions, use one of these strategies:
Keep only the last N turns in the messages array, always preserving the system message at the start.
MAX_TURNS = 10  # keep last 10 user+assistant pairs

def trim_messages(messages):
    system = [m for m in messages if m["role"] == "system"]
    rest = [m for m in messages if m["role"] != "system"]
    return system + rest[-MAX_TURNS * 2:]  # 2 messages per turn
When history gets long, ask the model to summarize it, then replace the old messages with the summary.
summary_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Summarize this conversation in 3-5 sentences, preserving key facts."},
        *old_messages
    ]
)
summary = summary_response.choices[0].message.content

# Replace history with summary
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "assistant", "content": f"[Conversation summary: {summary}]"},
]
If your system prompt or context is large and stable across many requests, enable Prompt Caching. The first request processes and caches the prefix; subsequent requests with the same prefix pay a fraction of the cost.

Estimating Costs Before Sending

You can estimate token usage before making a request by counting tokens locally. The tiktoken library implements OpenAI’s tokenizer:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(messages):
    total = 0
    for msg in messages:
        total += 4  # overhead per message
        for value in msg.values():
            total += len(enc.encode(str(value)))
    total += 2  # priming tokens
    return total

tokens = count_tokens(messages)
print(f"Estimated prompt tokens: {tokens}")
This is an approximation. For billing purposes, always use the usage values returned in the actual API response.

Token Cost Summary

Token typeBilling
Prompt tokensCharged per model’s input rate
Completion tokensCharged per model’s output rate
Cached prompt tokensDiscounted (typically 50–75% off input rate)
Reasoning tokensBilled as output tokens on most models
See the Models page for per-model rates and the Billing page to check your current balance and usage.