Home Posts Adaptive Rate Limiting for AI APIs [Deep Dive 2026]
AI Engineering

Adaptive Rate Limiting for AI APIs [Deep Dive 2026]

Adaptive Rate Limiting for AI APIs [Deep Dive 2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 09, 2026 · 9 min read

Bottom Line

For variable-cost inference, request-per-minute limits are not enough. The stable pattern is to reserve capacity on estimated token cost, then reconcile the reservation against actual usage as soon as the provider responds.

Key Takeaways

  • Charge requests by estimated token cost, not request count, when model cost varies widely.
  • Reserve against input + expected output, then refund or debit the delta from actual usage.
  • Use Redis for atomic bucket updates so concurrent workers cannot oversubscribe shared capacity.
  • Read provider headers like x-ratelimit-remaining-tokens as feedback, not your only control loop.

Variable-cost AI endpoints break naive rate limiting because one request might consume a few hundred tokens while the next burns through a large prompt plus a long completion. If you only cap requests per minute, you still get surprise 429s, uneven tenant fairness, and batch jobs starving interactive traffic. The fix is to treat each call as a weighted event: reserve budget up front, execute, then reconcile the reservation against real usage from the provider response.

Key takeaways

  • Use a weighted limiter keyed to estimated token or cost units.
  • Reserve with input_tokens + expected output, not historical averages alone.
  • Store limiter state centrally so horizontally scaled workers share one truth.
  • Reconcile from provider usage data immediately after each response.

Why adaptive rate limiting matters

Most AI providers enforce more than one quota. OpenAI, for example, documents limits for requests and tokens, and returns headers such as x-ratelimit-remaining-tokens and x-ratelimit-reset-tokens. That means your application has to protect two things at once:

  • Upstream provider quotas, so you do not stampede into provider-side throttling.
  • Internal fairness, so one tenant or batch job cannot consume the whole token budget.
  • User experience, so interactive requests get predictable latency even when workloads spike.

Bottom Line

For variable-cost inference, the limiter should charge estimated token cost before dispatch and settle against actual usage after completion. That one design choice is what turns rate limiting from a blunt gate into a stable control loop.

The pattern in this tutorial is deliberately simple:

  • Estimate cost from input size, model class, and maxoutputtokens.
  • Atomically reserve capacity from a shared Redis bucket.
  • Send the inference call only if the reservation succeeds.
  • Read the provider response usage object and adjust the bucket by the delta.

Prerequisites

Prerequisites box

  • A Node.js service in front of your inference provider.
  • Redis available to all application instances.
  • A rough token estimate before dispatch. If you already count tokens, use that. If not, use a conservative heuristic.
  • Provider responses that expose actual usage. OpenAI's Responses API returns usage.input_tokens, usage.output_tokens, and usage.total_tokens.
  • Basic observability: request logs, success rate, 429 counts, and queue latency.

If you need to share example payloads or tenant logs during rollout, scrub them first with TechBytes' Data Masking Tool. For the code blocks below, the Code Formatter is useful when you adapt the snippets to your own middleware stack.

Step 1: Estimate request cost

The estimate does not need to be perfect. It needs to be conservative enough that you rarely overspend the bucket before reconciliation runs.

  1. Count or estimate input tokens.
  2. Reserve some portion of expected output instead of the full ceiling, otherwise you underutilize capacity.
  3. Multiply by a model weight if some models are operationally more expensive or sit behind tighter provider quotas.
type InferenceRequest = {
  tenantId: string;
  model: string;
  inputTokens: number;
  maxOutputTokens: number;
  priority: "interactive" | "batch";
};

const MODEL_MULTIPLIER: Record<string, number> = {
  "gpt-4.1": 1.0,
  "gpt-4o": 0.8,
  "o3": 1.3
};

export function estimateCostUnits(req: InferenceRequest): number {
  const outputReserve = Math.ceil(req.maxOutputTokens * 0.8);
  const baseTokens = req.inputTokens + outputReserve;
  const modelWeight = MODEL_MULTIPLIER[req.model] ?? 1.0;
  const priorityWeight = req.priority === "batch" ? 0.7 : 1.0;

  return Math.max(1, Math.ceil(baseTokens * modelWeight * priorityWeight));
}

Two implementation notes matter here:

  • OpenAI documents maxoutputtokens as an upper bound on generated tokens, including visible output and reasoning tokens. That makes it the right knob for reservation logic.
  • Do not reserve the full ceiling unless your completions regularly hit it. Reserving around 70-90% of the ceiling is usually a better starting point than either 0% or 100%.
Pro tip: Keep estimation policy in one function and version it. Most rollout pain comes from multiple services each inventing their own cost math.

Step 2: Enforce a weighted bucket

Use a token bucket, but charge it with weighted cost units instead of one token per request. Redis is a good fit because atomic updates are straightforward and every worker sees the same budget.

  1. Create one bucket per tenant, model family, or traffic class.
  2. Refill continuously based on your allowed tokens per second.
  3. Subtract the estimated cost only if enough capacity remains.
-- KEYS[1] = bucket key
-- ARGV[1] = nowSeconds
-- ARGV[2] = refillPerSecond
-- ARGV[3] = burstCapacity
-- ARGV[4] = estimatedCost
-- ARGV[5] = ttlSeconds
local key = KEYS[1]
local now = tonumber(ARGV[1])
local refill = tonumber(ARGV[2])
local burst = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local ttl = tonumber(ARGV[5])

local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1])
local ts = tonumber(data[2])

if tokens == nil then tokens = burst end
if ts == nil then ts = now end

local elapsed = math.max(0, now - ts)
tokens = math.min(burst, tokens + (elapsed * refill))

local allowed = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
end

redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, ttl)
return { allowed, math.floor(tokens) }
import Redis from "ioredis";

const redis = new Redis(process.env.REDIS_URL!);
const reserveScript = `...lua above...`;

export async function reserveBudget(params: {
  tenantId: string;
  model: string;
  estimatedCost: number;
}) {
  const now = Math.floor(Date.now() / 1000);
  const key = `rl:${params.tenantId}:${params.model}`;
  const refillPerSecond = 2000;
  const burstCapacity = 12000;
  const ttlSeconds = 120;

  const [allowed, remaining] = (await redis.eval(
    reserveScript,
    1,
    key,
    now,
    refillPerSecond,
    burstCapacity,
    params.estimatedCost,
    ttlSeconds
  )) as [number, number];

  return { allowed: allowed === 1, remaining };
}

This is the critical property you want: concurrent workers cannot admit more traffic than the shared budget allows.

Step 3: Reconcile and verify

Once the request completes, compare your reservation to actual usage. OpenAI's response object exposes usage.input_tokens, usage.output_tokens, and usage.total_tokens, so you can settle the exact delta.

  1. Reserve before dispatch.
  2. Call the provider.
  3. Read usage.total_tokens from the response.
  4. Debit extra usage or refund unused capacity.
const adjustScript = `
local key = KEYS[1]
local delta = tonumber(ARGV[1])
local burst = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local ttl = tonumber(ARGV[4])
local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1])
if tokens == nil then tokens = burst end

tokens = math.max(0, math.min(burst, tokens - delta))
redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, ttl)
return math.floor(tokens)
`;

async function runInference(req: InferenceRequest, client: any) {
  const estimatedCost = estimateCostUnits(req);
  const bucket = await reserveBudget({
    tenantId: req.tenantId,
    model: req.model,
    estimatedCost
  });

  if (!bucket.allowed) {
    return { status: 429, body: { error: "Local rate limit exceeded" } };
  }

  const response = await client.responses.create({
    model: req.model,
    input: "...",
    max_output_tokens: req.maxOutputTokens
  });

  const actualCost = response.usage?.total_tokens ?? estimatedCost;
  const delta = actualCost - estimatedCost;

  await redis.eval(
    adjustScript,
    1,
    `rl:${req.tenantId}:${req.model}`,
    delta,
    12000,
    Math.floor(Date.now() / 1000),
    120
  );

  return { status: 200, body: response };
}

Verification and expected output

Run a burst test with mixed request sizes. Your logs should show small and large calls consuming different amounts from the same bucket.

[rate-limit] tenant=acme model=gpt-4.1 estimated=1800 allowed=true remaining=9200
[inference]  tenant=acme usage.total_tokens=1462 delta=-338 refunded=true
[rate-limit] tenant=acme model=gpt-4.1 estimated=6400 allowed=true remaining=3138
[inference]  tenant=acme usage.total_tokens=7025 delta=625 debited=true
[rate-limit] tenant=acme model=gpt-4.1 estimated=5100 allowed=false remaining=480

If the limiter is working, you should observe:

  • Fewer provider-side 429s during burst traffic.
  • Stable admission behavior across multiple app instances.
  • Higher utilization than a blunt request-count limiter because small requests no longer block behind large ones.

Troubleshooting top 3

  • Too many false rejects: Your estimate is too conservative or your burst capacity is too low. Lower the output reserve percentage before raising the burst blindly.
  • Provider still returns 429: Your local budget and upstream budget are drifting. Feed provider headers like x-ratelimit-remaining-tokens and x-ratelimit-reset-tokens into logs so you can tune refill rate and burst.
  • Batch traffic starves interactive traffic: Split buckets by traffic class or apply a lower priority weight to batch requests instead of sharing one undifferentiated pool.
Watch out: Do not wait minutes to reconcile usage. Delayed settlement makes the limiter look correct in code while quietly overspending your real upstream token budget.

What's next

  • Add per-tenant dashboards for admitted cost, refunded cost, reject rate, and upstream 429s.
  • Introduce separate buckets for interactive, batch, and tool-calling workloads.
  • Replace heuristic token estimates with your tokenizer or the provider's input token counting endpoint where available.
  • Layer circuit breakers on top so repeated upstream throttling automatically lowers local refill rates.

The main lesson is simple: adaptive rate limiting is not a different gate, it is a tighter feedback loop. Once you reserve on estimated cost and settle on actual usage, the rest of the system becomes much easier to reason about.

Frequently Asked Questions

How do you rate limit AI requests when each call uses a different number of tokens? +
Use a weighted limiter instead of a request counter. Reserve budget from a shared bucket using an estimate based on input_tokens, model weight, and max_output_tokens, then reconcile that estimate against actual usage.total_tokens after the response returns.
Should I rate limit on requests per minute or tokens per minute for inference APIs? +
For variable-cost inference, tokens per minute is the better primary control signal because it tracks actual provider pressure. Keep request limits as a secondary guardrail for abuse and connection churn, but do not rely on them alone.
Why use Redis for adaptive rate limiting? +
Redis gives you one shared budget across all application instances and supports atomic updates, which prevents concurrent workers from oversubscribing capacity. A local in-memory limiter can work for a single process, but it breaks down once traffic is distributed horizontally.
What should I do when my estimate is wrong? +
Being wrong is expected, which is why reconciliation is part of the design. If actual usage is lower, refund capacity; if it is higher, debit the delta immediately and tune your estimator over time using observed error bands.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.