Prompt Caching Cost Controls for Multi-Tenant Agents
Bottom Line
Prompt caching only controls cost when your agent keeps stable, reusable prompt prefixes while isolating tenant data. Treat cached tokens as a budget metric, not an invisible provider optimization.
Key Takeaways
- ›Cache stable system, policy, and tool schemas before dynamic tenant input.
- ›Track
cached_tokensor provider cache-read fields per tenant and route. - ›Never put secrets or mutable tenant state in a shared reusable prefix.
- ›Fail builds when cache hit rate drops after prompt or tool-schema changes.
Prompt caching can cut repeated-input cost and latency for AI agents, but multi-tenant systems need stricter controls than a single-user chatbot. The practical goal is not simply to “turn caching on.” It is to keep long, stable prompt prefixes reusable while making sure tenant-specific data, secrets, policy versions, and budget accounting stay isolated. This tutorial builds a small provider-neutral pattern you can drop into an agent gateway.
Prerequisites
Prerequisites box
- A Node.js service that sends requests through one agent gateway or API wrapper.
- Access to provider usage metadata. For OpenAI, check
usage.prompt_tokens_details.cached_tokens; for Anthropic, check cache creation and cache read token fields. - A tenant identifier available before model invocation.
- A stable place to version system prompts, tool schemas, and retrieval policies.
- Basic redaction rules for logs. For sample test data, use TechBytes' Data Masking Tool before pasting payloads into fixtures.
Bottom Line
The cheapest multi-tenant agent is not the one with the shortest prompt; it is the one with a deterministic reusable prefix, tenant-scoped telemetry, and budget gates that react when cache hits disappear.
Design the Cache Contract
OpenAI's official prompt caching behavior is automatic for supported models and reports cached prompt tokens in usage metadata. Its docs describe caching the longest previously seen prompt prefix starting at 1,024 tokens and growing in 128-token increments. Anthropic's official documentation uses explicit cache_control blocks, including ephemeral caches with a default 5-minute lifetime and optional longer TTLs where supported. The implementation detail differs, but the application contract is the same: place stable material first and volatile material later.
Put these in the stable prefix
- Global system instructions that change only with a prompt version.
- Tool names, JSON schemas, and routing rules that are identical across tenants.
- Compliance policy text that is shared by product tier or region.
- Few-shot examples that do not contain tenant data.
Keep these out of the shared prefix
- Raw customer documents, tickets, transcripts, or private user profile fields.
- Tenant secrets, API keys, database names, and internal account notes.
- Frequently changing retrieval results or tool outputs.
- Random timestamps, trace IDs, and nonce values that would invalidate the prefix.
Build the Cost Control Path
- Create a deterministic prompt layout.
- Compute a cache key for telemetry, not for provider reuse.
- Call the model through one gateway function.
- Record cached tokens, uncached input tokens, and output tokens by tenant.
- Enforce a budget before allowing expensive uncached calls.
Step 1: define stable and dynamic blocks
const crypto = require('node:crypto');
function stablePrefix({ productPolicy, toolSchemas, promptVersion }) {
return [
`prompt_version=${promptVersion}`,
'You are the support automation agent for a B2B SaaS platform.',
productPolicy,
JSON.stringify(toolSchemas)
].join('\n\n');
}
function dynamicSuffix({ tenantId, userId, retrievalContext, userMessage }) {
return [
`tenant_id=${tenantId}`,
`user_id=${userId}`,
'Retrieved context:',
retrievalContext,
'User message:',
userMessage
].join('\n\n');
}
function prefixFingerprint(prefix) {
return crypto.createHash('sha256').update(prefix).digest('hex').slice(0, 16);
}The prefixFingerprint is for your logs and dashboards. It does not force provider caching, but it makes cache regressions visible when a prompt edit, schema change, or whitespace churn alters the reusable prefix.
Step 2: add a tenant budget gate
function estimateUncachedRisk({ inputTokens, cachedTokens }) {
return Math.max(inputTokens - cachedTokens, 0);
}
function assertTenantBudget({ tenantId, estimatedUncachedTokens, monthlyRemaining }) {
if (estimatedUncachedTokens > monthlyRemaining) {
throw new Error(
`Tenant ${tenantId} would exceed uncached-token budget: ` +
`${estimatedUncachedTokens} requested, ${monthlyRemaining} remaining`
);
}
}Budgeting against total tokens hides cache misses. Budgeting against estimated uncached input tokens catches the painful failure mode: a small prompt refactor that turns every request into a cold request.
Step 3: normalize provider usage
function normalizeOpenAIUsage(response) {
const usage = response.usage || {};
const promptDetails = usage.prompt_tokens_details || {};
return {
inputTokens: usage.input_tokens || usage.prompt_tokens || 0,
outputTokens: usage.output_tokens || usage.completion_tokens || 0,
cachedInputTokens: promptDetails.cached_tokens || 0
};
}
function normalizeAnthropicUsage(response) {
const usage = response.usage || {};
return {
inputTokens: usage.input_tokens || 0,
outputTokens: usage.output_tokens || 0,
cachedInputTokens: usage.cache_read_input_tokens || 0,
cacheCreationTokens: usage.cache_creation_input_tokens || 0
};
}Step 4: log the cost event
async function recordAgentUsage(db, event) {
await db.agent_usage.insert({
tenant_id: event.tenantId,
route: event.route,
provider: event.provider,
prefix_fingerprint: event.prefixFingerprint,
input_tokens: event.inputTokens,
cached_input_tokens: event.cachedInputTokens,
output_tokens: event.outputTokens,
uncached_input_tokens: event.inputTokens - event.cachedInputTokens,
created_at: new Date().toISOString()
});
}This schema supports three practical controls: per-tenant budget enforcement, prompt-version regression detection, and route-level optimization. If one route never crosses the provider's minimum cacheable prefix size, do not spend engineering time tuning it.
Verification and Expected Output
Run two identical requests for the same route where only the user message changes after the stable prefix. Then query your usage table by prefix_fingerprint.
select
tenant_id,
route,
prefix_fingerprint,
count(*) as calls,
sum(input_tokens) as input_tokens,
sum(cached_input_tokens) as cached_input_tokens,
round(100.0 * sum(cached_input_tokens) / nullif(sum(input_tokens), 0), 2) as cache_hit_pct
from agent_usage
where created_at > now() - interval '15 minutes'
group by tenant_id, route, prefix_fingerprint
order by cache_hit_pct asc;Expected output for a healthy long-prefix route looks like this:
tenant_id | route | prefix_fingerprint | calls | input_tokens | cached_input_tokens | cache_hit_pct
acme | ticket.reply | 9f3a21c88b17e104 | 25 | 142500 | 104960 | 73.66- 0% cache hit can be valid for a new prefix, a short prompt, or a provider/model that does not support caching for that request.
- 40-80% often means the stable prefix is large and dynamic retrieval is still significant.
- 90%+ is realistic only when the reusable prefix dominates the request.
Troubleshooting Top 3
1. Cache hits stay at zero
- Check whether the reusable prefix is large enough for the provider's documented minimum.
- Remove timestamps, request IDs, randomized examples, and reordered JSON from the prefix.
- Confirm that usage metadata is being read from the current response shape.
2. One tenant spends far more than others
- Group spend by tenant_id, route, and prefix_fingerprint.
- Look for tenant-specific policies or custom tools inserted before the shared prefix.
- Move tenant customization after the cacheable block unless it must govern all later instructions.
3. A prompt release doubled cost
- Compare cache hit percentage before and after the prompt_version change.
- Diff the stable prefix with a formatter before blaming traffic. TechBytes' Code Formatter is useful for normalizing JSON tool schemas.
- Roll back prompt layout changes separately from instruction wording changes.
What's Next
Once the gateway records cache-aware usage, add automated controls around it. Start with alerts, then move to release gates.
- Add a CI fixture that renders the stable prefix and fails when accidental dynamic fields appear.
- Create a dashboard for cachehitpct, uncachedinputtokens, and cost per tenant.
- Review official provider docs during model migrations: OpenAI prompt caching and Anthropic prompt caching expose different controls and usage fields.
- For sensitive tenants, run a privacy review that treats prompt prefixes as production data, even when the provider says caching is scoped and temporary.
Frequently Asked Questions
How do I measure prompt caching savings in a multi-tenant agent? +
usage.prompt_tokens_details.cached_tokens; for Anthropic, read cache read and cache creation input token fields. Store those values with tenant_id, route, and prompt version.Should tenant-specific instructions go before or after the cached prefix? +
Why did my cached token count drop after changing tool schemas? +
Is prompt caching enough for SaaS cost control? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
AI Agent Observability Metrics That Catch Real Failures
A practical guide to logging, tracing, and evaluating production agent behavior without drowning in token noise.
Security Deep-DiveSecure RAG Tenant Isolation for SaaS Teams
How to keep retrieval pipelines from leaking customer context across accounts, indexes, and prompt boundaries.
System ArchitectureLLM Cost Optimization Playbook for Engineering Teams
A field guide to routing, caching, batching, and budget controls for production AI workloads.