Home Posts Tenant-Aware Rate Limiting for LLM Apps [2026]
AI Engineering

Tenant-Aware Rate Limiting for LLM Apps [2026]

Tenant-Aware Rate Limiting for LLM Apps [2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · June 18, 2026 · 7 min read

Bottom Line

A tenant-aware token bucket lets every customer spend from a fair, isolated budget. Backpressure turns overload into predictable waits or fast failures instead of runaway queues and surprise LLM bills.

Key Takeaways

  • Limit by tenant, not only by API key or global process
  • Use tokens for both request count and estimated LLM cost
  • Return 429 for exhausted budgets and 503 for overloaded queues
  • Keep queue depth, wait time, and token rejections in metrics

LLM applications fail differently from ordinary APIs: one noisy tenant can consume shared model capacity, increase latency for everyone, and create a real invoice before dashboards catch up. A tenant-aware limiter fixes that by charging each customer against an isolated budget before work enters the expensive path. In this tutorial, you will build a token-bucket limiter and add backpressure so overload becomes observable, bounded, and recoverable.

Prerequisites

Bottom Line

Use one token bucket per tenant and model tier, then apply queue backpressure after admission. Rate limiting protects customer fairness; backpressure protects your workers.

You need a basic Node.js HTTP service, a Redis-compatible datastore for shared counters, and a clear tenant identifier on every authenticated request. The examples use plain JavaScript so the mechanics are visible, but the same pattern works in TypeScript, Go, Python, or any stack that can execute an atomic Redis script.

Pro tip: Before logging prompts, redact personal or customer data with a workflow like the Data Masking Tool. Rate-limit logs often contain request metadata that should not become a privacy incident.
  • Tenant ID: a stable customer, workspace, or organization identifier.
  • Capacity: maximum burst budget a tenant can spend immediately.
  • Refill rate: how many cost tokens return per second.
  • Cost estimate: predicted prompt plus completion token usage.
  • Queue limit: maximum in-flight or waiting LLM jobs allowed globally.

Build the Limiter

1. Model cost as limiter tokens

A request-count limiter treats a 100-token chat and a 100,000-token summarization job as equal. LLM systems need a cost-aware unit. Start with an estimate that is intentionally conservative: prompt tokens you already know, plus a configured maximum completion size.

function estimateCostTokens({ promptTokens, maxOutputTokens, modelMultiplier = 1 }) {
  return Math.ceil((promptTokens + maxOutputTokens) * modelMultiplier);
}

const cost = estimateCostTokens({
  promptTokens: 1800,
  maxOutputTokens: 600,
  modelMultiplier: 1
});

Keep the model multiplier in configuration. Expensive model tiers can charge more limiter tokens without changing the rest of the system.

2. Store one bucket per tenant

The bucket key should include the tenant and, when useful, the model class. This prevents a batch job for one tenant from draining another tenant's interactive chat allowance.

function bucketKey({ tenantId, modelClass }) {
  return `rl:${tenantId}:${modelClass}`;
}

3. Use an atomic token-bucket script

The limiter must be atomic because multiple app instances may admit requests at the same time. This Redis Lua script refills the bucket based on elapsed time, charges the request cost, and returns whether the request is allowed.

const TOKEN_BUCKET_LUA = `
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_per_ms = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'updated_at')
local tokens = tonumber(bucket[1]) or capacity
local updated_at = tonumber(bucket[2]) or now

local elapsed = math.max(0, now - updated_at)
tokens = math.min(capacity, tokens + elapsed * refill_per_ms)

if tokens < cost then
  local needed = cost - tokens
  local retry_ms = math.ceil(needed / refill_per_ms)
  redis.call('HMSET', key, 'tokens', tokens, 'updated_at', now)
  redis.call('PEXPIRE', key, math.ceil(capacity / refill_per_ms))
  return {0, math.floor(tokens), retry_ms}
end

tokens = tokens - cost
redis.call('HMSET', key, 'tokens', tokens, 'updated_at', now)
redis.call('PEXPIRE', key, math.ceil(capacity / refill_per_ms))
return {1, math.floor(tokens), 0}
`;

4. Wrap admission in middleware

The middleware estimates cost, checks the tenant bucket, and rejects exhausted tenants with 429. Include Retry-After so well-behaved clients can slow down.

async function rateLimitTenant(req, res, next) {
  const tenantId = req.auth.tenantId;
  const modelClass = req.body.modelClass || 'standard';
  const cost = estimateCostTokens({
    promptTokens: req.body.promptTokens,
    maxOutputTokens: req.body.maxOutputTokens,
    modelMultiplier: modelClass === 'premium' ? 4 : 1
  });

  const capacity = 120000;
  const refillPerMinute = 60000;
  const refillPerMs = refillPerMinute / 60000;

  const [allowed, remaining, retryMs] = await redis.eval(
    TOKEN_BUCKET_LUA,
    1,
    bucketKey({ tenantId, modelClass }),
    Date.now(),
    capacity,
    refillPerMs,
    cost
  );

  res.setHeader('X-RateLimit-Remaining-Tokens', String(remaining));

  if (!allowed) {
    res.setHeader('Retry-After', String(Math.ceil(retryMs / 1000)));
    return res.status(429).json({
      error: 'tenant_rate_limit_exceeded',
      retryMs
    });
  }

  req.limiterCost = cost;
  next();
}

Add Backpressure

5. Separate tenant fairness from system saturation

Rate limiting answers, “Can this tenant spend more?” Backpressure answers, “Can the platform accept more work right now?” You need both, because a fair tenant can still arrive during a regional outage, a model provider slowdown, or a worker deployment.

  • Use 429 when the tenant has exhausted its configured budget.
  • Use 503 when the shared queue or worker pool is saturated.
  • Track queue wait time separately from model latency.
  • Prefer bounded queues over unbounded promises in memory.
const MAX_QUEUE_DEPTH = 500;
const MAX_WAIT_MS = 30_000;
const llmQueue = [];

function enqueueWithBackpressure(job, res) {
  if (llmQueue.length >= MAX_QUEUE_DEPTH) {
    return res.status(503).json({
      error: 'llm_queue_saturated',
      retryMs: 5000
    });
  }

  const queuedAt = Date.now();
  llmQueue.push({ job, res, queuedAt });
}

async function workerLoop() {
  while (true) {
    const item = llmQueue.shift();
    if (!item) {
      await sleep(25);
      continue;
    }

    if (Date.now() - item.queuedAt > MAX_WAIT_MS) {
      item.res.status(503).json({ error: 'llm_queue_timeout' });
      continue;
    }

    const result = await callModelProvider(item.job);
    item.res.json(result);
  }
}
Watch out: Do not refund limiter tokens automatically on every provider failure. A tenant that repeatedly submits oversized or invalid jobs can otherwise bypass fairness by forcing retries.

Verify the Behavior

6. Test one tenant at the edge

Send repeated requests for the same tenant until the bucket is empty. You should see successful responses first, then a structured 429 with a retry delay.

HTTP/1.1 429 Too Many Requests
Retry-After: 12
X-RateLimit-Remaining-Tokens: 318
Content-Type: application/json

{
  "error": "tenant_rate_limit_exceeded",
  "retryMs": 11842
}

7. Test noisy-neighbor isolation

Run the same burst for tenant_a and then immediately send a small request for tenant_b. The second tenant should still pass if its own bucket has capacity.

tenant_a large request: 429 tenant_rate_limit_exceeded
tenant_b small request: 200 ok

8. Test queue saturation

Pause or slow the worker loop and send enough admitted jobs to fill the queue. The expected result is 503, not rising memory usage and not a process crash.

HTTP/1.1 503 Service Unavailable
Content-Type: application/json

{
  "error": "llm_queue_saturated",
  "retryMs": 5000
}

For production dashboards, graph these metrics together:

  • tenantlimiterrejections_total: count of 429 responses by tenant tier.
  • llmqueuedepth: current pending work.
  • llmqueuewait_ms: time from admission to worker start.
  • estimatedcosttokens: admitted cost before provider call.
  • actualusagetokens: provider-reported usage after completion.

Troubleshooting

Top 3 failure modes

  1. Every tenant is throttled too early: check units. A refill rate configured per minute but applied per millisecond will make buckets refill far too slowly.
  2. One tenant still affects everyone: look for shared keys such as rl:global or missing tenant IDs in background jobs.
  3. Latency climbs without 503s: your queue is probably unbounded. Add a hard depth limit and a maximum wait time.

Also compare estimated and actual model usage. If estimates are consistently too low, tenants will be admitted cheaply and the provider bill will drift above plan. If estimates are too high, legitimate users will see unnecessary 429 responses. Tune with percentile-based production data, not a single average prompt.

What's Next

Once the basic limiter is stable, make it policy-driven. Store tenant plans in a configuration service, expose remaining budget in your customer dashboard, and add separate buckets for interactive, batch, and admin workflows.

  • Add soft warnings when tenants reach 80% of their rolling budget.
  • Charge premium model classes with a higher multiplier.
  • Use idempotency keys so client retries do not double-charge admitted work.
  • Feed actual provider usage back into billing and capacity planning.

The end state is not just fewer outages. It is a system where product plans, customer fairness, queue health, and LLM cost all share the same control loop.

Frequently Asked Questions

Should LLM rate limits count requests or tokens? +
For LLM applications, token-aware limits are usually better because requests can vary by orders of magnitude in cost. A practical design charges estimated prompt plus maximum output tokens before admission, then records actual provider usage afterward.
What HTTP status should I return when a tenant exceeds its LLM quota? +
Return 429 Too Many Requests when the tenant bucket is empty. Include Retry-After so clients know when to try again, and keep that separate from 503 responses caused by platform saturation.
Why do I need backpressure if I already have rate limiting? +
Rate limiting protects fairness between tenants, but it does not guarantee your worker pool is healthy. Backpressure caps shared queue depth and wait time so provider slowdowns or deployment issues fail predictably instead of consuming memory.
Should I refund rate-limit tokens when an LLM request fails? +
Refund selectively, not automatically. Provider-side transient failures may deserve a refund, but validation errors, client cancellations, and repeated oversized jobs should usually remain charged to prevent abuse.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.