Tenant-Aware Rate Limiting for LLM Apps [Guide 2026]
Bottom Line
Rate limit LLM apps by tenant and estimated tokens, not just request count. Pair token buckets with 429 backpressure so clients slow down before expensive model calls begin.
Key Takeaways
- ›Use one Redis token bucket per tenant and model scope for shared fairness.
- ›Debit by estimated prompt plus output tokens before calling the LLM.
- ›Return 429 with Retry-After so clients back off predictably.
- ›Keep refill and debit atomic with a Lua script to avoid burst races.
Tenant-aware rate limiting is the difference between a predictable LLM platform and one noisy customer burning everyone else's latency and budget. This tutorial builds a Redis-backed token bucket for per-tenant request and token budgets, then adds HTTP backpressure so clients slow down instead of retrying into an outage. The implementation is intentionally small: one Lua script, one Express middleware, and clear metrics you can wire into production.
Prerequisites
Before you start
- Node.js application using Express or a similar middleware pipeline.
- Redis reachable from each API instance.
- A tenant identifier on every request, such as
tenant_idfrom auth claims. - A token estimate before the LLM call, ideally
prompt_tokens + max_output_tokens. - Basic observability for status codes, latency, and queue depth.
Bottom Line
Use one token bucket per tenant for fairness, charge it by estimated LLM tokens, and return 429 with Retry-After when capacity is gone. Backpressure is part of the API contract, not an afterthought.
Numbered Steps
1. Model the tenant limit
LLM traffic is not uniform. A single request might ask for a short classification or a multi-page synthesis, so request-per-minute limits alone are too blunt. Use two budgets:
- Request budget: caps concurrency pressure and protects routing, auth, and orchestration.
- Token budget: caps provider spend and long-running generation load.
- Burst capacity: lets a tenant briefly exceed the steady refill rate without monopolizing the system.
A practical starting policy is one bucket for estimated tokens and, if needed, a second bucket for raw requests. This tutorial focuses on the token bucket because it maps directly to LLM cost.
const tenantPlans = {
starter: { capacity: 20_000, refillPerSecond: 200 },
pro: { capacity: 200_000, refillPerSecond: 2_000 },
enterprise: { capacity: 2_000_000, refillPerSecond: 20_000 }
};
function estimateTokens(req) {
const promptTokens = Number(req.body.prompt_tokens_estimate || 0);
const maxOutputTokens = Number(req.body.max_output_tokens || 0);
return Math.max(1, promptTokens + maxOutputTokens);
}
2. Create the Redis token bucket
Redis is a good fit because every API instance can share the same tenant state. The key detail is atomicity: read, refill, debit, and write must happen as one operation. Redis Lua scripts execute atomically, which avoids double-spending tokens during concurrent requests.
const TOKEN_BUCKET_LUA = `
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_per_ms = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local ttl_ms = tonumber(ARGV[5])
local bucket = redis.call('HMGET', key, 'tokens', 'updated_at')
local tokens = tonumber(bucket[1]) or capacity
local updated_at = tonumber(bucket[2]) or now
local elapsed = math.max(0, now - updated_at)
tokens = math.min(capacity, tokens + (elapsed * refill_per_ms))
local allowed = 0
local retry_after_ms = 0
if tokens >= cost then
tokens = tokens - cost
allowed = 1
else
retry_after_ms = math.ceil((cost - tokens) / refill_per_ms)
end
redis.call('HSET', key, 'tokens', tokens, 'updated_at', now)
redis.call('PEXPIRE', key, ttl_ms)
return { allowed, math.floor(tokens), retry_after_ms }
`;
The key should include tenant and limit scope. Keep user-level or route-level dimensions separate only when they are real product requirements.
function bucketKey({ tenantId, model }) {
return `rl:tenant:${tenantId}:model:${model}:tokens`;
}
3. Add backpressure middleware
Backpressure means the service gives callers a clear signal before work enters the expensive path. For public APIs, the most interoperable response is 429 plus Retry-After. For internal workers, you can also pause a queue, reduce concurrency, or shed low-priority jobs.
import express from 'express';
import { createClient } from 'redis';
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
function tenantRateLimit() {
return async function rateLimitMiddleware(req, res, next) {
const tenantId = req.auth?.tenant_id;
const planName = req.auth?.plan || 'starter';
const model = req.body.model || 'default';
if (!tenantId) {
return res.status(401).json({ error: 'missing_tenant' });
}
const plan = tenantPlans[planName] || tenantPlans.starter;
const cost = estimateTokens(req);
const now = Date.now();
const refillPerMs = plan.refillPerSecond / 1000;
const ttlMs = Math.ceil((plan.capacity / plan.refillPerSecond) * 2000);
const [allowed, remaining, retryAfterMs] = await redis.eval(TOKEN_BUCKET_LUA, {
keys: [bucketKey({ tenantId, model })],
arguments: [
String(now),
String(plan.capacity),
String(refillPerMs),
String(cost),
String(ttlMs)
]
});
res.setHeader('X-RateLimit-Limit-Tokens', String(plan.capacity));
res.setHeader('X-RateLimit-Remaining-Tokens', String(remaining));
if (allowed !== 1) {
const retryAfterSeconds = Math.max(1, Math.ceil(Number(retryAfterMs) / 1000));
res.setHeader('Retry-After', String(retryAfterSeconds));
return res.status(429).json({
error: 'rate_limited',
retry_after_seconds: retryAfterSeconds,
estimated_token_cost: cost
});
}
return next();
};
}
const app = express();
app.use(express.json());
app.post('/v1/chat', tenantRateLimit(), async (req, res) => {
res.json({ ok: true, accepted: true });
});
Use a formatter before publishing shared snippets or SDK examples; the Code Formatter keeps middleware samples readable when teams copy them into multiple services.
Verification and Expected Output
Install the minimum dependencies and run your API with a Redis URL configured.
npm install express redis
REDIS_URL=redis://localhost:6379 node server.js
Send a request that fits inside the tenant bucket:
curl -i -X POST http://localhost:3000/v1/chat -H 'Content-Type: application/json' -d '{"model":"support-bot","prompt_tokens_estimate":1200,"max_output_tokens":400}'
Expected response:
HTTP/1.1 200 OK
X-RateLimit-Limit-Tokens: 20000
X-RateLimit-Remaining-Tokens: 18400
{"ok":true,"accepted":true}
Then send a request larger than the available bucket:
curl -i -X POST http://localhost:3000/v1/chat -H 'Content-Type: application/json' -d '{"model":"support-bot","prompt_tokens_estimate":50000,"max_output_tokens":50000}'
Expected backpressure response:
HTTP/1.1 429 Too Many Requests
Retry-After: 400
X-RateLimit-Limit-Tokens: 20000
X-RateLimit-Remaining-Tokens: 20000
{"error":"rate_limited","retry_after_seconds":400,"estimated_token_cost":100000}
In production, verify these metrics before raising limits:
- 429 rate by tenant: identifies tenants that need plan changes or client fixes.
- Retry-After distribution: shows whether limits are merely smoothing bursts or blocking normal work.
- LLM provider latency: should stabilize when overload is shifted out of the hot path.
- Redis eval latency: should remain small compared with LLM latency.
Troubleshooting Top 3
1. Every request is rate limited
- Check that
refillPerSecondis greater than zero. - Confirm token estimates are realistic and not accidentally measured in characters.
- Inspect the Redis key for stale values from a previous test policy.
2. Tenants bypass limits during traffic spikes
- Make sure all API instances use the same Redis deployment.
- Keep the script atomic; do not split refill and debit into separate client calls.
- Use tenant IDs from verified auth claims, not user-supplied request fields.
3. Clients retry too aggressively
- Return Retry-After on every 429.
- Document exponential backoff with jitter for SDK users.
- For batch jobs, queue deferred work instead of retrying immediately inside request handlers.
What's Next
This limiter is enough for a production first pass, but mature LLM platforms usually add policy layers around it:
- Post-call reconciliation: debit or refund the difference between estimated and actual tokens.
- Priority lanes: reserve capacity for interactive requests ahead of batch summarization.
- Model-specific pricing: charge expensive models with a higher token cost multiplier.
- Tenant dashboards: expose usage, remaining budget, and backoff guidance without support tickets.
- Circuit breakers: combine tenant limits with provider-health checks when an upstream model slows down.
The design rule is simple: apply fairness before the LLM call, communicate delay explicitly, and make the client part of the control loop.
Frequently Asked Questions
Should LLM rate limits count requests or tokens? +
Why use Redis Lua for a token bucket? +
What should an API return when a tenant exceeds an LLM limit? +
How do you handle actual token usage after the LLM call? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
API Rate Limiting Algorithms [2026 Dev Cheat Sheet]
Compare fixed window, sliding window, token bucket, and leaky bucket patterns for API throttling.
Cloud Infrastructure[Deep Dive] Self-Healing REST APIs: Retries & Circuit Breakers
Design retry, fallback, and circuit breaker behavior that avoids retry storms during partial outages.
AI EngineeringLLM Token Efficiency [Deep Dive] in Production 2026
Reduce LLM latency and spend with caching, batching, and context optimization patterns.