Home Posts API Rate Limiting Algorithms [2026 Dev Cheat Sheet]
Developer Reference

API Rate Limiting Algorithms [2026 Dev Cheat Sheet]

API Rate Limiting Algorithms [2026 Dev Cheat Sheet]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · April 29, 2026 · 11 min read

Bottom Line

Use Token Bucket for burst-friendly APIs, Leaky Bucket for steady downstream protection, and Fixed Window only when simplicity outweighs edge-case fairness.

Key Takeaways

  • Token Bucket allows short bursts while keeping a hard long-term refill rate.
  • Leaky Bucket smooths traffic best, but it can add queueing latency under load.
  • Fixed Window is easiest to ship, but boundary spikes can double effective traffic.
  • Distributed rate limits fail on identity, clock, and storage design before math fails.

API rate limiting looks simple until real traffic shows up: bursts, retries, uneven tenants, and horizontally scaled gateways all stress the model. The three algorithms most teams compare are Token Bucket, Leaky Bucket, and Fixed Window. This cheat sheet is built for fast engineering decisions, with copy-ready commands, practical configuration patterns, and the tradeoffs that matter when you are protecting APIs instead of explaining them on a whiteboard.

  • Token Bucket is the usual default for public APIs because it tolerates bursts without losing long-term control.
  • Leaky Bucket wins when you need smooth output rates for workers, queues, or fragile backends.
  • Fixed Window is operationally simple, but window boundaries can create unfair spikes.
  • Most production incidents come from bad key design, unsynced counters, or weak observability, not from the algorithm name.
Dimension Token Bucket Leaky Bucket Fixed Window Edge
Burst tolerance High up to bucket capacity Low to medium High near reset boundaries Token Bucket
Output smoothness Moderate High, steady drain Low Leaky Bucket
Implementation simplicity Moderate Moderate High Fixed Window
Fairness over time Good Good Weak at window edges Token Bucket / Leaky Bucket
Latency behavior Reject or delay on empty bucket Often queues first, then rejects Hard allow or reject Depends on product goal
Best fit Public APIs, tiered quotas Traffic shaping, worker protection Simple per-minute caps No single winner

At a Glance

Bottom Line

Token Bucket is the default choice for most API gateways. Use Leaky Bucket when backend stability matters more than burst tolerance, and keep Fixed Window for simpler limits where boundary unfairness is acceptable.

Mental model

  • Token Bucket: tokens refill at a steady rate; each request spends one or more tokens.
  • Leaky Bucket: incoming requests enter a bucket that drains at a fixed rate.
  • Fixed Window: count requests in a time bucket such as 1 minute, then reset the counter.

What teams usually get wrong

  • They rate limit by IP when the real fairness unit is tenant, API key, user, or route.
  • They use a global limit but forget expensive endpoints need tighter per-route controls.
  • They ship counters without exposing Retry-After and remaining-budget headers.
  • They pick a simple algorithm, then quietly rebuild it later to handle distributed state.

When to Choose Each Algorithm

Choose Token Bucket when:

  • Your API should absorb short spikes such as page loads, retries, or batch fan-out.
  • You sell usage tiers and want a clear refill rate plus a clear burst allowance.
  • You want better fairness than a simple window without building a more complex sliding model.

Choose Leaky Bucket when:

  • Your backend, queue consumer, or third-party dependency needs a steady request drain.
  • You prefer smoothing traffic over maximizing user-visible burst capacity.
  • You are comfortable with queueing behavior and possible added latency under load.

Choose Fixed Window when:

  • You need the fastest implementation for a low-risk internal or admin API.
  • Your traffic pattern is predictable enough that boundary spikes are acceptable.
  • You plan to graduate later to Token Bucket or a sliding approach once usage grows.
Watch out: Fixed Window can effectively allow nearly 2x the intended rate across a window boundary, such as 100 requests in the last second of one minute and 100 more in the first second of the next.

Live Search and Shortcuts

For a cheat-sheet style page, a lightweight client-side filter is enough. Tag each block with data-cheat-item, let users press / to focus search, and keep the results purely local for instant filtering.

Live search filter

const input = document.querySelector('[data-filter-input]');
const items = [...document.querySelectorAll('[data-cheat-item]')];

function applyFilter(query) {
  const q = query.trim().toLowerCase();

  items.forEach((item) => {
    const haystack = item.textContent.toLowerCase();
    const match = q === '' || haystack.includes(q);
    item.hidden = !match;
  });

  const url = new URL(window.location.href);
  if (q) url.searchParams.set('q', q);
  else url.searchParams.delete('q');
  history.replaceState(null, '', url);
}

input.addEventListener('input', (event) => {
  applyFilter(event.target.value);
});

window.addEventListener('keydown', (event) => {
  if (event.key === '/') {
    event.preventDefault();
    input.focus();
  }
});

const params = new URL(window.location.href).searchParams;
const initialQuery = params.get('q') || '';
input.value = initialQuery;
applyFilter(initialQuery);

Keyboard shortcuts

Shortcut Action Why it matters
/ Focus search Fastest way to jump to a command, header, or algorithm note.
Esc Clear focus Lets users return to reading without touching the mouse.
j / k Next or previous section Useful for dense reference content with a sticky table of contents.
c Copy active code block Pairs well with automatic copy buttons on every <pre>.

Commands by Purpose

These commands are for verification, not synthetic benchmarks. Use -i to inspect headers, -D to dump response headers, and -P to parallelize burst traffic tests.

Inspect rate-limit headers

curl -i https://api.example.com/v1/search

curl -s -D - -o /dev/null https://api.example.com/v1/search \
  | grep -iE 'x-ratelimit|retry-after'

Simulate a burst

seq 1 20 | xargs -P 20 -I{} curl -s -o /dev/null -w '%{http_code}\n' \
  https://api.example.com/v1/search | sort | uniq -c

Watch refill behavior over time

for n in 1 2 3 4 5 6 7 8 9 10; do
  date -u '+%H:%M:%S'
  curl -s -D - -o /dev/null https://api.example.com/v1/search \
    | grep -iE 'x-ratelimit-remaining|retry-after'
  sleep 1
done

Compare multiple identities

curl -s -H 'Authorization: Bearer tenant-a-key' -D - -o /dev/null \
  https://api.example.com/v1/search

curl -s -H 'Authorization: Bearer tenant-b-key' -D - -o /dev/null \
  https://api.example.com/v1/search

Configuration Patterns

Keep configs explicit about four things: identity key, refill or drain behavior, burst allowance, and response headers. If those are vague, operators will reverse-engineer policy from production logs.

Token Bucket baseline

{
  "algorithm": "token_bucket",
  "key": "tenant_id:route",
  "capacity": 120,
  "refill_tokens": 2,
  "refill_interval": "1s",
  "cost_by_method": {
    "GET": 1,
    "POST": 2
  },
  "headers": true
}
  • Map capacity to tolerated burst size.
  • Map refill values to the steady-state rate you want to sell or enforce.
  • Use request cost when some routes are materially more expensive than others.

Leaky Bucket baseline

{
  "algorithm": "leaky_bucket",
  "key": "api_key",
  "queue_capacity": 200,
  "drain_rate": 25,
  "drain_interval": "1s",
  "overflow_policy": "reject"
}
  • Use queue capacity to cap backlog growth before latency becomes product debt.
  • Pick reject or delay based on whether clients can tolerate waiting.
  • Drain rate should reflect backend safety, not just gateway throughput.

Fixed Window baseline

{
  "algorithm": "fixed_window",
  "key": "user_id",
  "limit": 100,
  "window": "1m",
  "headers": true,
  "reset_header": true
}
  • Use per-route overrides for login, search, export, and webhook endpoints.
  • Keep window sizes human-readable for support teams and customer-facing docs.
  • If traffic is spiky, plan a migration path before customers discover edge bursts.

Advanced Usage and Gotchas

Layer limits instead of forcing one global rule

  • Apply a global tenant limit, then narrower route limits for expensive endpoints.
  • Separate write-heavy limits from read-heavy limits when backend cost differs sharply.
  • Use concurrency limits alongside rate limits for long-running operations.

Design identities before counters

  • Prefer stable business identities such as tenant ID or API key over raw IP address.
  • Store enough metadata to debug collisions, but redact sensitive values in logs and dashboards.
  • If you need to sanitize exported traces or support samples, the Data Masking Tool is a practical fit for scrubbing tokens, emails, and customer identifiers.

Distributed systems change the failure mode

  • Cross-node counters introduce consistency tradeoffs that do not exist in single-process demos.
  • Clock drift matters when resets or refill schedules are time-based.
  • Hot keys appear fast on shared infrastructure, especially when one tenant dominates traffic.

Expose policy clearly to clients

  • Return common headers such as current limit, remaining budget, reset time, and Retry-After.
  • Document whether rejected requests are safe to retry and how backoff should work.
  • Keep examples synchronized with production behavior or customers will code around the docs.
Pro tip: If your product has both interactive users and batch clients, ship different limits for each class instead of pretending one number is fair to both.

Frequently Asked Questions

What is the best rate limiting algorithm for a public API? +
For most public APIs, Token Bucket is the safest default because it supports short bursts while still enforcing a predictable long-term rate. It is easier to explain to customers than queue-based shaping, and it avoids the worst fairness problems of Fixed Window.
Why is Fixed Window considered unfair at scale? +
Fixed Window resets counters on hard boundaries, so clients can send a full limit at the end of one window and another full limit at the start of the next. That edge case can create a short spike far above the intended steady rate, which is why boundary-sensitive systems often move away from it.
When should I use Leaky Bucket instead of Token Bucket? +
Use Leaky Bucket when downstream services need smooth, predictable throughput more than clients need burst flexibility. It is especially useful when your system would rather queue or steadily drain work than let request spikes hit a fragile dependency.
Should API rate limits be keyed by IP address or API key? +
Key by the identity that matches product fairness, usually tenant_id, user_id, or API key rather than raw IP address. IP-based limiting is still useful as a coarse abuse control, but it is a weak primary key for shared networks, NAT, proxies, and enterprise traffic.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.