Tenant-Aware Rate Limiting for LLM Apps [2026]
Bottom Line
A tenant-aware token bucket lets every customer spend from a fair, isolated budget. Backpressure turns overload into predictable waits or fast failures instead of runaway queues and surprise LLM bills.
Key Takeaways
- ›Limit by tenant, not only by API key or global process
- ›Use tokens for both request count and estimated LLM cost
- ›Return 429 for exhausted budgets and 503 for overloaded queues
- ›Keep queue depth, wait time, and token rejections in metrics
LLM applications fail differently from ordinary APIs: one noisy tenant can consume shared model capacity, increase latency for everyone, and create a real invoice before dashboards catch up. A tenant-aware limiter fixes that by charging each customer against an isolated budget before work enters the expensive path. In this tutorial, you will build a token-bucket limiter and add backpressure so overload becomes observable, bounded, and recoverable.
Prerequisites
Bottom Line
Use one token bucket per tenant and model tier, then apply queue backpressure after admission. Rate limiting protects customer fairness; backpressure protects your workers.
You need a basic Node.js HTTP service, a Redis-compatible datastore for shared counters, and a clear tenant identifier on every authenticated request. The examples use plain JavaScript so the mechanics are visible, but the same pattern works in TypeScript, Go, Python, or any stack that can execute an atomic Redis script.
- Tenant ID: a stable customer, workspace, or organization identifier.
- Capacity: maximum burst budget a tenant can spend immediately.
- Refill rate: how many cost tokens return per second.
- Cost estimate: predicted prompt plus completion token usage.
- Queue limit: maximum in-flight or waiting LLM jobs allowed globally.
Build the Limiter
1. Model cost as limiter tokens
A request-count limiter treats a 100-token chat and a 100,000-token summarization job as equal. LLM systems need a cost-aware unit. Start with an estimate that is intentionally conservative: prompt tokens you already know, plus a configured maximum completion size.
function estimateCostTokens({ promptTokens, maxOutputTokens, modelMultiplier = 1 }) {
return Math.ceil((promptTokens + maxOutputTokens) * modelMultiplier);
}
const cost = estimateCostTokens({
promptTokens: 1800,
maxOutputTokens: 600,
modelMultiplier: 1
});
Keep the model multiplier in configuration. Expensive model tiers can charge more limiter tokens without changing the rest of the system.
2. Store one bucket per tenant
The bucket key should include the tenant and, when useful, the model class. This prevents a batch job for one tenant from draining another tenant's interactive chat allowance.
function bucketKey({ tenantId, modelClass }) {
return `rl:${tenantId}:${modelClass}`;
}
3. Use an atomic token-bucket script
The limiter must be atomic because multiple app instances may admit requests at the same time. This Redis Lua script refills the bucket based on elapsed time, charges the request cost, and returns whether the request is allowed.
const TOKEN_BUCKET_LUA = `
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_per_ms = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'updated_at')
local tokens = tonumber(bucket[1]) or capacity
local updated_at = tonumber(bucket[2]) or now
local elapsed = math.max(0, now - updated_at)
tokens = math.min(capacity, tokens + elapsed * refill_per_ms)
if tokens < cost then
local needed = cost - tokens
local retry_ms = math.ceil(needed / refill_per_ms)
redis.call('HMSET', key, 'tokens', tokens, 'updated_at', now)
redis.call('PEXPIRE', key, math.ceil(capacity / refill_per_ms))
return {0, math.floor(tokens), retry_ms}
end
tokens = tokens - cost
redis.call('HMSET', key, 'tokens', tokens, 'updated_at', now)
redis.call('PEXPIRE', key, math.ceil(capacity / refill_per_ms))
return {1, math.floor(tokens), 0}
`;
4. Wrap admission in middleware
The middleware estimates cost, checks the tenant bucket, and rejects exhausted tenants with 429. Include Retry-After so well-behaved clients can slow down.
async function rateLimitTenant(req, res, next) {
const tenantId = req.auth.tenantId;
const modelClass = req.body.modelClass || 'standard';
const cost = estimateCostTokens({
promptTokens: req.body.promptTokens,
maxOutputTokens: req.body.maxOutputTokens,
modelMultiplier: modelClass === 'premium' ? 4 : 1
});
const capacity = 120000;
const refillPerMinute = 60000;
const refillPerMs = refillPerMinute / 60000;
const [allowed, remaining, retryMs] = await redis.eval(
TOKEN_BUCKET_LUA,
1,
bucketKey({ tenantId, modelClass }),
Date.now(),
capacity,
refillPerMs,
cost
);
res.setHeader('X-RateLimit-Remaining-Tokens', String(remaining));
if (!allowed) {
res.setHeader('Retry-After', String(Math.ceil(retryMs / 1000)));
return res.status(429).json({
error: 'tenant_rate_limit_exceeded',
retryMs
});
}
req.limiterCost = cost;
next();
}
Add Backpressure
5. Separate tenant fairness from system saturation
Rate limiting answers, “Can this tenant spend more?” Backpressure answers, “Can the platform accept more work right now?” You need both, because a fair tenant can still arrive during a regional outage, a model provider slowdown, or a worker deployment.
- Use 429 when the tenant has exhausted its configured budget.
- Use 503 when the shared queue or worker pool is saturated.
- Track queue wait time separately from model latency.
- Prefer bounded queues over unbounded promises in memory.
const MAX_QUEUE_DEPTH = 500;
const MAX_WAIT_MS = 30_000;
const llmQueue = [];
function enqueueWithBackpressure(job, res) {
if (llmQueue.length >= MAX_QUEUE_DEPTH) {
return res.status(503).json({
error: 'llm_queue_saturated',
retryMs: 5000
});
}
const queuedAt = Date.now();
llmQueue.push({ job, res, queuedAt });
}
async function workerLoop() {
while (true) {
const item = llmQueue.shift();
if (!item) {
await sleep(25);
continue;
}
if (Date.now() - item.queuedAt > MAX_WAIT_MS) {
item.res.status(503).json({ error: 'llm_queue_timeout' });
continue;
}
const result = await callModelProvider(item.job);
item.res.json(result);
}
}
Verify the Behavior
6. Test one tenant at the edge
Send repeated requests for the same tenant until the bucket is empty. You should see successful responses first, then a structured 429 with a retry delay.
HTTP/1.1 429 Too Many Requests
Retry-After: 12
X-RateLimit-Remaining-Tokens: 318
Content-Type: application/json
{
"error": "tenant_rate_limit_exceeded",
"retryMs": 11842
}
7. Test noisy-neighbor isolation
Run the same burst for tenant_a and then immediately send a small request for tenant_b. The second tenant should still pass if its own bucket has capacity.
tenant_a large request: 429 tenant_rate_limit_exceeded
tenant_b small request: 200 ok
8. Test queue saturation
Pause or slow the worker loop and send enough admitted jobs to fill the queue. The expected result is 503, not rising memory usage and not a process crash.
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
{
"error": "llm_queue_saturated",
"retryMs": 5000
}
For production dashboards, graph these metrics together:
- tenantlimiterrejections_total: count of 429 responses by tenant tier.
- llmqueuedepth: current pending work.
- llmqueuewait_ms: time from admission to worker start.
- estimatedcosttokens: admitted cost before provider call.
- actualusagetokens: provider-reported usage after completion.
Troubleshooting
Top 3 failure modes
- Every tenant is throttled too early: check units. A refill rate configured per minute but applied per millisecond will make buckets refill far too slowly.
- One tenant still affects everyone: look for shared keys such as
rl:globalor missing tenant IDs in background jobs. - Latency climbs without 503s: your queue is probably unbounded. Add a hard depth limit and a maximum wait time.
Also compare estimated and actual model usage. If estimates are consistently too low, tenants will be admitted cheaply and the provider bill will drift above plan. If estimates are too high, legitimate users will see unnecessary 429 responses. Tune with percentile-based production data, not a single average prompt.
What's Next
Once the basic limiter is stable, make it policy-driven. Store tenant plans in a configuration service, expose remaining budget in your customer dashboard, and add separate buckets for interactive, batch, and admin workflows.
- Add soft warnings when tenants reach 80% of their rolling budget.
- Charge premium model classes with a higher multiplier.
- Use idempotency keys so client retries do not double-charge admitted work.
- Feed actual provider usage back into billing and capacity planning.
The end state is not just fewer outages. It is a system where product plans, customer fairness, queue health, and LLM cost all share the same control loop.
Frequently Asked Questions
Should LLM rate limits count requests or tokens? +
What HTTP status should I return when a tenant exceeds its LLM quota? +
Retry-After so clients know when to try again, and keep that separate from 503 responses caused by platform saturation.Why do I need backpressure if I already have rate limiting? +
Should I refund rate-limit tokens when an LLM request fails? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.