Prompt Caching Cost Controls for AI Agents [Guide]
Bottom Line
Prompt caching is not a generic CDN for prompts. Treat it as a prefix-stability contract: keep shared agent instructions identical, isolate tenant-specific context, and verify savings with cached-token telemetry.
Key Takeaways
- ›Cache hits require identical prompt prefixes; put tenant-specific data at the end.
- ›OpenAI caching starts at 1024+ prompt tokens and reports cached_tokens in usage.
- ›Use promptcachekey granularity to group shared prefixes without mixing tenant context.
- ›Track hit rate, cached-token ratio, and per-tenant savings before changing prompts.
Prompt caching can turn repetitive agent context from a recurring tax into a measurable cost control, but multi-tenant systems need stricter boundaries than single-user demos. As of June 18, 2026, OpenAI documents automatic caching for eligible prompts, Anthropic exposes explicit cache breakpoints, and Gemini supports implicit and explicit context caching. This tutorial builds a provider-aware pattern for tenant-safe cache keys, stable prompt prefixes, cost telemetry, and regression checks.
Prerequisites
Bottom Line
Caching only pays when the beginning of the prompt stays byte-for-byte stable. Put shared agent scaffolding first, push tenant and user variability last, and prove savings from usage fields before counting them in margin forecasts.
Prerequisites box
- A multi-tenant agent service with per-tenant billing or budgets.
- An LLM provider that reports cached-token usage, such as OpenAI
usage.prompt_tokens_details.cached_tokensor Anthropiccache_read_input_tokens. - A stable system prompt, tool schema, policy block, or retrieval summary of at least several hundred tokens.
- Server-side logging for tenant ID, model, prompt version, latency, and token usage.
Official docs are the source of truth for provider behavior. OpenAI states that Prompt Caching works automatically on recent models, starts at 1024 or more prompt tokens, and can reduce cached input costs by up to 90%. Anthropic requires or supports cache_control breakpoints depending on mode, and Gemini separates implicit caching from explicit caches with a TTL.
1. Design Cache Boundaries
In a single-tenant agent, you can often cache the entire policy, tool list, and workspace context. In a multi-tenant agent, the first question is not "what can be cached?" It is "what can be reused without crossing tenant boundaries?"
Split the prompt into three zones
- Global stable prefix: agent role, safety policy, response format, tool schemas, and examples that are identical for every tenant.
- Tenant stable prefix: tenant plan, enabled tools, organization policy, and approved knowledge summary that are stable for one tenant only.
- Request suffix: user message, current timestamp, trace ID, retrieved private records, and any short-lived context.
The cache-friendly structure is simple: global first, tenant-stable second, request-specific last. Before moving production data into prompts, run representative payloads through a privacy pass. TechBytes' Data Masking Tool is useful for scrubbing example logs before they become prompt fixtures or test cases.
2. Implement Cached Agent Calls
The core implementation is a prompt builder that returns stable strings in stable order. Do not let application objects serialize themselves unpredictably. Sort keys, version your prompt, and keep the cache key coarse enough to hit but narrow enough to avoid tenant confusion.
Create a stable prompt builder
function stableJson(value) {
if (Array.isArray(value)) return value.map(stableJson);
if (value && typeof value === 'object') {
return Object.fromEntries(
Object.keys(value).sort().map((key) => [key, stableJson(value[key])])
);
}
return value;
}
function buildAgentInput({ tenant, userMessage, retrievedFacts }) {
const globalPrefix = [
'You are Acme Support Agent v4.',
'Follow the enterprise support policy exactly.',
'Return JSON with fields: answer, citations, escalation_required.',
'Tool schema version: support-tools-2026-06-18.'
].join('\n');
const tenantPrefix = JSON.stringify(stableJson({
tenant_id: tenant.id,
plan: tenant.plan,
enabled_tools: tenant.enabledTools,
policy_version: tenant.policyVersion
}));
const requestSuffix = JSON.stringify(stableJson({
user_message: userMessage,
retrieved_facts: retrievedFacts
}));
return `${globalPrefix}\n\nTENANT_CONTEXT:\n${tenantPrefix}\n\nREQUEST:\n${requestSuffix}`;
}Send a cache-aware request
For OpenAI-style automatic prompt caching, the important controls are prompt shape, consistent promptcachekey, and optional promptcacheretention where supported. The following server-side call uses Responses API fields documented by OpenAI.
async function callAgent({ tenant, userMessage, retrievedFacts }) {
const input = buildAgentInput({ tenant, userMessage, retrievedFacts });
const response = await fetch('https://api.openai.com/v1/responses', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-5',
input,
prompt_cache_key: `agent:v4:tenant:${tenant.id}:policy:${tenant.policyVersion}`,
prompt_cache_retention: '24h'
})
});
if (!response.ok) throw new Error(await response.text());
return response.json();
}If your model or organization defaults differ, remove prompt_cache_retention and rely on provider defaults. For providers with explicit cache controls, place the breakpoint at the end of the last stable block, not on the user message.
3. Verification And Expected Output
Run the same tenant and prompt version twice, changing only the request suffix. The first request should create or miss the cache. The second should show cached prompt tokens if the prefix is long enough and still resident.
Log the cache telemetry
function cacheStats(providerResponse) {
const usage = providerResponse.usage || {};
const promptDetails = usage.prompt_tokens_details || {};
return {
prompt_tokens: usage.prompt_tokens,
cached_tokens: promptDetails.cached_tokens || 0,
total_tokens: usage.total_tokens,
cache_ratio: usage.prompt_tokens
? Number(((promptDetails.cached_tokens || 0) / usage.prompt_tokens).toFixed(3))
: 0
};
}
console.log(cacheStats(response));Expected output
// First request after a prompt version change
{ prompt_tokens: 2450, cached_tokens: 0, total_tokens: 2740, cache_ratio: 0 }
// Second request with the same stable prefix
{ prompt_tokens: 2488, cached_tokens: 1920, total_tokens: 2801, cache_ratio: 0.772 }In production, track cache behavior as a first-class cost metric:
- Cache hit ratio: percentage of eligible calls with cached tokens greater than zero.
- Cached-token ratio: cached prompt tokens divided by total prompt tokens.
- Tenant savings: baseline uncached input cost minus actual cached input cost.
- Latency delta: median and p95 latency for hits versus misses.
- Prompt churn: number of prompt version changes per deployment.
Do not expect cached tokens to reduce output-token charges or provider rate-limit accounting. OpenAI documentation notes that prompt caching does not change generated output and cached prompts still contribute to token-per-minute limits.
4. Troubleshooting Top 3
1. cached_tokens stays at zero
- Confirm the full prompt is at least 1024 tokens for OpenAI automatic caching.
- Check that stable content is at the beginning, not after the user request.
- Compare two serialized prompts and find the first differing character.
2. Hit rate drops after a deploy
- Verify that the prompt version, tool schema order, and response schema are unchanged.
- Look for new timestamps, feature flags, request IDs, or unsorted JSON in the prefix.
- Roll cache metrics up by prompt version so expected misses after releases are visible.
3. One tenant saves money, another does not
- Check traffic cadence; short-lived in-memory caches need repeated requests within the retention window.
- Group by promptcachekey to see whether keys are too granular.
- Validate tenant-specific policy size; small tenants may not reach the provider's cache threshold.
What's Next
After basic caching works, turn it into a budget guardrail. Add CI checks that diff the first 1500 tokens of generated prompts, alert when cached-token ratio falls below a tenant-specific threshold, and expose savings in your internal billing dashboard.
The next engineering step is provider abstraction, but keep it thin. Normalize telemetry fields into cached_input_tokens, cache_write_tokens, cache_read_tokens, and cache_retention. Keep provider-specific behavior in adapters because OpenAI automatic routing, Anthropic breakpoints, and Gemini explicit TTLs are not the same system.
Finally, treat cached prefixes like production API contracts. Review them for privacy, version them deliberately, and test them with realistic tenant fixtures before every agent release.
Frequently Asked Questions
How do I make prompt caching work for multi-tenant AI agents? +
prompt_cache_key or equivalent routing key so reusable prefixes are grouped without sharing private tenant context.Why are cached_tokens zero even when I repeat the same prompt? +
cached_tokens.Can cached prompts leak data between tenants? +
Does prompt caching reduce output token cost? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.