Home Posts Tenant-Aware Rate Limiting for LLM Apps [Guide 2026]
AI Engineering

Tenant-Aware Rate Limiting for LLM Apps [Guide 2026]

Tenant-Aware Rate Limiting for LLM Apps [Guide 2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · June 05, 2026 · 7 min read

Bottom Line

Rate limit LLM apps by tenant and estimated tokens, not just request count. Pair token buckets with 429 backpressure so clients slow down before expensive model calls begin.

Key Takeaways

  • Use one Redis token bucket per tenant and model scope for shared fairness.
  • Debit by estimated prompt plus output tokens before calling the LLM.
  • Return 429 with Retry-After so clients back off predictably.
  • Keep refill and debit atomic with a Lua script to avoid burst races.

Tenant-aware rate limiting is the difference between a predictable LLM platform and one noisy customer burning everyone else's latency and budget. This tutorial builds a Redis-backed token bucket for per-tenant request and token budgets, then adds HTTP backpressure so clients slow down instead of retrying into an outage. The implementation is intentionally small: one Lua script, one Express middleware, and clear metrics you can wire into production.

Prerequisites

Before you start

  • Node.js application using Express or a similar middleware pipeline.
  • Redis reachable from each API instance.
  • A tenant identifier on every request, such as tenant_id from auth claims.
  • A token estimate before the LLM call, ideally prompt_tokens + max_output_tokens.
  • Basic observability for status codes, latency, and queue depth.

Bottom Line

Use one token bucket per tenant for fairness, charge it by estimated LLM tokens, and return 429 with Retry-After when capacity is gone. Backpressure is part of the API contract, not an afterthought.

Numbered Steps

1. Model the tenant limit

LLM traffic is not uniform. A single request might ask for a short classification or a multi-page synthesis, so request-per-minute limits alone are too blunt. Use two budgets:

  • Request budget: caps concurrency pressure and protects routing, auth, and orchestration.
  • Token budget: caps provider spend and long-running generation load.
  • Burst capacity: lets a tenant briefly exceed the steady refill rate without monopolizing the system.

A practical starting policy is one bucket for estimated tokens and, if needed, a second bucket for raw requests. This tutorial focuses on the token bucket because it maps directly to LLM cost.

const tenantPlans = {
  starter: { capacity: 20_000, refillPerSecond: 200 },
  pro: { capacity: 200_000, refillPerSecond: 2_000 },
  enterprise: { capacity: 2_000_000, refillPerSecond: 20_000 }
};

function estimateTokens(req) {
  const promptTokens = Number(req.body.prompt_tokens_estimate || 0);
  const maxOutputTokens = Number(req.body.max_output_tokens || 0);
  return Math.max(1, promptTokens + maxOutputTokens);
}
Pro tip: Use TechBytes' Data Masking Tool when creating logs or fixtures from tenant prompts, because rate-limit debugging should not expose private customer data.

2. Create the Redis token bucket

Redis is a good fit because every API instance can share the same tenant state. The key detail is atomicity: read, refill, debit, and write must happen as one operation. Redis Lua scripts execute atomically, which avoids double-spending tokens during concurrent requests.

const TOKEN_BUCKET_LUA = `
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_per_ms = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local ttl_ms = tonumber(ARGV[5])

local bucket = redis.call('HMGET', key, 'tokens', 'updated_at')
local tokens = tonumber(bucket[1]) or capacity
local updated_at = tonumber(bucket[2]) or now

local elapsed = math.max(0, now - updated_at)
tokens = math.min(capacity, tokens + (elapsed * refill_per_ms))

local allowed = 0
local retry_after_ms = 0

if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
else
  retry_after_ms = math.ceil((cost - tokens) / refill_per_ms)
end

redis.call('HSET', key, 'tokens', tokens, 'updated_at', now)
redis.call('PEXPIRE', key, ttl_ms)

return { allowed, math.floor(tokens), retry_after_ms }
`;

The key should include tenant and limit scope. Keep user-level or route-level dimensions separate only when they are real product requirements.

function bucketKey({ tenantId, model }) {
  return `rl:tenant:${tenantId}:model:${model}:tokens`;
}

3. Add backpressure middleware

Backpressure means the service gives callers a clear signal before work enters the expensive path. For public APIs, the most interoperable response is 429 plus Retry-After. For internal workers, you can also pause a queue, reduce concurrency, or shed low-priority jobs.

import express from 'express';
import { createClient } from 'redis';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

function tenantRateLimit() {
  return async function rateLimitMiddleware(req, res, next) {
    const tenantId = req.auth?.tenant_id;
    const planName = req.auth?.plan || 'starter';
    const model = req.body.model || 'default';

    if (!tenantId) {
      return res.status(401).json({ error: 'missing_tenant' });
    }

    const plan = tenantPlans[planName] || tenantPlans.starter;
    const cost = estimateTokens(req);
    const now = Date.now();
    const refillPerMs = plan.refillPerSecond / 1000;
    const ttlMs = Math.ceil((plan.capacity / plan.refillPerSecond) * 2000);

    const [allowed, remaining, retryAfterMs] = await redis.eval(TOKEN_BUCKET_LUA, {
      keys: [bucketKey({ tenantId, model })],
      arguments: [
        String(now),
        String(plan.capacity),
        String(refillPerMs),
        String(cost),
        String(ttlMs)
      ]
    });

    res.setHeader('X-RateLimit-Limit-Tokens', String(plan.capacity));
    res.setHeader('X-RateLimit-Remaining-Tokens', String(remaining));

    if (allowed !== 1) {
      const retryAfterSeconds = Math.max(1, Math.ceil(Number(retryAfterMs) / 1000));
      res.setHeader('Retry-After', String(retryAfterSeconds));
      return res.status(429).json({
        error: 'rate_limited',
        retry_after_seconds: retryAfterSeconds,
        estimated_token_cost: cost
      });
    }

    return next();
  };
}

const app = express();
app.use(express.json());
app.post('/v1/chat', tenantRateLimit(), async (req, res) => {
  res.json({ ok: true, accepted: true });
});

Use a formatter before publishing shared snippets or SDK examples; the Code Formatter keeps middleware samples readable when teams copy them into multiple services.

Verification and Expected Output

Install the minimum dependencies and run your API with a Redis URL configured.

npm install express redis
REDIS_URL=redis://localhost:6379 node server.js

Send a request that fits inside the tenant bucket:

curl -i -X POST http://localhost:3000/v1/chat -H 'Content-Type: application/json' -d '{"model":"support-bot","prompt_tokens_estimate":1200,"max_output_tokens":400}'

Expected response:

HTTP/1.1 200 OK
X-RateLimit-Limit-Tokens: 20000
X-RateLimit-Remaining-Tokens: 18400

{"ok":true,"accepted":true}

Then send a request larger than the available bucket:

curl -i -X POST http://localhost:3000/v1/chat -H 'Content-Type: application/json' -d '{"model":"support-bot","prompt_tokens_estimate":50000,"max_output_tokens":50000}'

Expected backpressure response:

HTTP/1.1 429 Too Many Requests
Retry-After: 400
X-RateLimit-Limit-Tokens: 20000
X-RateLimit-Remaining-Tokens: 20000

{"error":"rate_limited","retry_after_seconds":400,"estimated_token_cost":100000}

In production, verify these metrics before raising limits:

  • 429 rate by tenant: identifies tenants that need plan changes or client fixes.
  • Retry-After distribution: shows whether limits are merely smoothing bursts or blocking normal work.
  • LLM provider latency: should stabilize when overload is shifted out of the hot path.
  • Redis eval latency: should remain small compared with LLM latency.

Troubleshooting Top 3

1. Every request is rate limited

  • Check that refillPerSecond is greater than zero.
  • Confirm token estimates are realistic and not accidentally measured in characters.
  • Inspect the Redis key for stale values from a previous test policy.

2. Tenants bypass limits during traffic spikes

  • Make sure all API instances use the same Redis deployment.
  • Keep the script atomic; do not split refill and debit into separate client calls.
  • Use tenant IDs from verified auth claims, not user-supplied request fields.

3. Clients retry too aggressively

  • Return Retry-After on every 429.
  • Document exponential backoff with jitter for SDK users.
  • For batch jobs, queue deferred work instead of retrying immediately inside request handlers.

What's Next

This limiter is enough for a production first pass, but mature LLM platforms usually add policy layers around it:

  • Post-call reconciliation: debit or refund the difference between estimated and actual tokens.
  • Priority lanes: reserve capacity for interactive requests ahead of batch summarization.
  • Model-specific pricing: charge expensive models with a higher token cost multiplier.
  • Tenant dashboards: expose usage, remaining budget, and backoff guidance without support tickets.
  • Circuit breakers: combine tenant limits with provider-health checks when an upstream model slows down.

The design rule is simple: apply fairness before the LLM call, communicate delay explicitly, and make the client part of the control loop.

Frequently Asked Questions

Should LLM rate limits count requests or tokens? +
Count both when you can, but token limits are the better primary guardrail for LLM cost and latency. A short classification and a long generation should not consume the same quota. Use request limits for orchestration pressure and token limits for model spend.
Why use Redis Lua for a token bucket? +
The bucket must refill, check capacity, debit tokens, and persist state atomically. Redis Lua scripts execute as one operation, so concurrent API instances cannot spend the same remaining tokens twice.
What should an API return when a tenant exceeds an LLM limit? +
Return 429 and include Retry-After in seconds. The response body should also include a machine-readable error and the estimated token cost so SDKs can log or adapt behavior.
How do you handle actual token usage after the LLM call? +
Start by debiting estimated tokens before the call, then reconcile after the provider returns usage. If actual usage is lower, refund the difference; if it is higher, debit the extra amount or carry it into the next request.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.