Home Posts Prompt Caching Cost Controls for AI Agents [Guide]
AI Engineering

Prompt Caching Cost Controls for AI Agents [Guide]

Prompt Caching Cost Controls for AI Agents [Guide]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · June 18, 2026 · 6 min read

Bottom Line

Prompt caching is not a generic CDN for prompts. Treat it as a prefix-stability contract: keep shared agent instructions identical, isolate tenant-specific context, and verify savings with cached-token telemetry.

Key Takeaways

  • Cache hits require identical prompt prefixes; put tenant-specific data at the end.
  • OpenAI caching starts at 1024+ prompt tokens and reports cached_tokens in usage.
  • Use promptcachekey granularity to group shared prefixes without mixing tenant context.
  • Track hit rate, cached-token ratio, and per-tenant savings before changing prompts.

Prompt caching can turn repetitive agent context from a recurring tax into a measurable cost control, but multi-tenant systems need stricter boundaries than single-user demos. As of June 18, 2026, OpenAI documents automatic caching for eligible prompts, Anthropic exposes explicit cache breakpoints, and Gemini supports implicit and explicit context caching. This tutorial builds a provider-aware pattern for tenant-safe cache keys, stable prompt prefixes, cost telemetry, and regression checks.

Prerequisites

Bottom Line

Caching only pays when the beginning of the prompt stays byte-for-byte stable. Put shared agent scaffolding first, push tenant and user variability last, and prove savings from usage fields before counting them in margin forecasts.

Prerequisites box

  • A multi-tenant agent service with per-tenant billing or budgets.
  • An LLM provider that reports cached-token usage, such as OpenAI usage.prompt_tokens_details.cached_tokens or Anthropic cache_read_input_tokens.
  • A stable system prompt, tool schema, policy block, or retrieval summary of at least several hundred tokens.
  • Server-side logging for tenant ID, model, prompt version, latency, and token usage.

Official docs are the source of truth for provider behavior. OpenAI states that Prompt Caching works automatically on recent models, starts at 1024 or more prompt tokens, and can reduce cached input costs by up to 90%. Anthropic requires or supports cache_control breakpoints depending on mode, and Gemini separates implicit caching from explicit caches with a TTL.

1. Design Cache Boundaries

In a single-tenant agent, you can often cache the entire policy, tool list, and workspace context. In a multi-tenant agent, the first question is not "what can be cached?" It is "what can be reused without crossing tenant boundaries?"

Split the prompt into three zones

  1. Global stable prefix: agent role, safety policy, response format, tool schemas, and examples that are identical for every tenant.
  2. Tenant stable prefix: tenant plan, enabled tools, organization policy, and approved knowledge summary that are stable for one tenant only.
  3. Request suffix: user message, current timestamp, trace ID, retrieved private records, and any short-lived context.

The cache-friendly structure is simple: global first, tenant-stable second, request-specific last. Before moving production data into prompts, run representative payloads through a privacy pass. TechBytes' Data Masking Tool is useful for scrubbing example logs before they become prompt fixtures or test cases.

Watch out: A timestamp, nonce, random JSON key order, or tenant-specific value near the top of the prompt can invalidate the entire reusable prefix.

2. Implement Cached Agent Calls

The core implementation is a prompt builder that returns stable strings in stable order. Do not let application objects serialize themselves unpredictably. Sort keys, version your prompt, and keep the cache key coarse enough to hit but narrow enough to avoid tenant confusion.

Create a stable prompt builder

function stableJson(value) {
  if (Array.isArray(value)) return value.map(stableJson);
  if (value && typeof value === 'object') {
    return Object.fromEntries(
      Object.keys(value).sort().map((key) => [key, stableJson(value[key])])
    );
  }
  return value;
}

function buildAgentInput({ tenant, userMessage, retrievedFacts }) {
  const globalPrefix = [
    'You are Acme Support Agent v4.',
    'Follow the enterprise support policy exactly.',
    'Return JSON with fields: answer, citations, escalation_required.',
    'Tool schema version: support-tools-2026-06-18.'
  ].join('\n');

  const tenantPrefix = JSON.stringify(stableJson({
    tenant_id: tenant.id,
    plan: tenant.plan,
    enabled_tools: tenant.enabledTools,
    policy_version: tenant.policyVersion
  }));

  const requestSuffix = JSON.stringify(stableJson({
    user_message: userMessage,
    retrieved_facts: retrievedFacts
  }));

  return `${globalPrefix}\n\nTENANT_CONTEXT:\n${tenantPrefix}\n\nREQUEST:\n${requestSuffix}`;
}

Send a cache-aware request

For OpenAI-style automatic prompt caching, the important controls are prompt shape, consistent promptcachekey, and optional promptcacheretention where supported. The following server-side call uses Responses API fields documented by OpenAI.

async function callAgent({ tenant, userMessage, retrievedFacts }) {
  const input = buildAgentInput({ tenant, userMessage, retrievedFacts });

  const response = await fetch('https://api.openai.com/v1/responses', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'gpt-5',
      input,
      prompt_cache_key: `agent:v4:tenant:${tenant.id}:policy:${tenant.policyVersion}`,
      prompt_cache_retention: '24h'
    })
  });

  if (!response.ok) throw new Error(await response.text());
  return response.json();
}

If your model or organization defaults differ, remove prompt_cache_retention and rely on provider defaults. For providers with explicit cache controls, place the breakpoint at the end of the last stable block, not on the user message.

3. Verification And Expected Output

Run the same tenant and prompt version twice, changing only the request suffix. The first request should create or miss the cache. The second should show cached prompt tokens if the prefix is long enough and still resident.

Log the cache telemetry

function cacheStats(providerResponse) {
  const usage = providerResponse.usage || {};
  const promptDetails = usage.prompt_tokens_details || {};

  return {
    prompt_tokens: usage.prompt_tokens,
    cached_tokens: promptDetails.cached_tokens || 0,
    total_tokens: usage.total_tokens,
    cache_ratio: usage.prompt_tokens
      ? Number(((promptDetails.cached_tokens || 0) / usage.prompt_tokens).toFixed(3))
      : 0
  };
}

console.log(cacheStats(response));

Expected output

// First request after a prompt version change
{ prompt_tokens: 2450, cached_tokens: 0, total_tokens: 2740, cache_ratio: 0 }

// Second request with the same stable prefix
{ prompt_tokens: 2488, cached_tokens: 1920, total_tokens: 2801, cache_ratio: 0.772 }

In production, track cache behavior as a first-class cost metric:

  • Cache hit ratio: percentage of eligible calls with cached tokens greater than zero.
  • Cached-token ratio: cached prompt tokens divided by total prompt tokens.
  • Tenant savings: baseline uncached input cost minus actual cached input cost.
  • Latency delta: median and p95 latency for hits versus misses.
  • Prompt churn: number of prompt version changes per deployment.

Do not expect cached tokens to reduce output-token charges or provider rate-limit accounting. OpenAI documentation notes that prompt caching does not change generated output and cached prompts still contribute to token-per-minute limits.

4. Troubleshooting Top 3

1. cached_tokens stays at zero

  • Confirm the full prompt is at least 1024 tokens for OpenAI automatic caching.
  • Check that stable content is at the beginning, not after the user request.
  • Compare two serialized prompts and find the first differing character.

2. Hit rate drops after a deploy

  • Verify that the prompt version, tool schema order, and response schema are unchanged.
  • Look for new timestamps, feature flags, request IDs, or unsorted JSON in the prefix.
  • Roll cache metrics up by prompt version so expected misses after releases are visible.

3. One tenant saves money, another does not

  • Check traffic cadence; short-lived in-memory caches need repeated requests within the retention window.
  • Group by promptcachekey to see whether keys are too granular.
  • Validate tenant-specific policy size; small tenants may not reach the provider's cache threshold.

What's Next

After basic caching works, turn it into a budget guardrail. Add CI checks that diff the first 1500 tokens of generated prompts, alert when cached-token ratio falls below a tenant-specific threshold, and expose savings in your internal billing dashboard.

The next engineering step is provider abstraction, but keep it thin. Normalize telemetry fields into cached_input_tokens, cache_write_tokens, cache_read_tokens, and cache_retention. Keep provider-specific behavior in adapters because OpenAI automatic routing, Anthropic breakpoints, and Gemini explicit TTLs are not the same system.

Finally, treat cached prefixes like production API contracts. Review them for privacy, version them deliberately, and test them with realistic tenant fixtures before every agent release.

Frequently Asked Questions

How do I make prompt caching work for multi-tenant AI agents? +
Put identical global instructions first, tenant-stable context second, and request-specific data last. Use a tenant-scoped prompt_cache_key or equivalent routing key so reusable prefixes are grouped without sharing private tenant context.
Why are cached_tokens zero even when I repeat the same prompt? +
The prompt may be below the provider's minimum cacheable size, the cache may have expired, or the prefix may not be identical. For OpenAI, caching is available for prompts containing 1024 tokens or more, and the usage object reports cached_tokens.
Can cached prompts leak data between tenants? +
Provider caches are isolated by provider account boundaries, but your application still controls what goes into a reusable prefix. Do not place tenant secrets in a global prefix, and use tenant-scoped prompt versions and cache keys for tenant-specific context.
Does prompt caching reduce output token cost? +
No. Prompt caching affects repeated input or prompt processing, not model generation. You still pay for output tokens and should separately control response length, tool loops, and retry behavior.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.