Every API you depend on — OpenAI, Anthropic, Stripe, GitHub — enforces a rate limit. The hard part isn't the algorithm (fixed window, token bucket, sliding window all work); it's coordination. Per-worker limits multiply by worker count and turn throttling into self-inflicted retry storms. The pattern that scales: one shared budget in front of each upstream, Retry-After respected as a floor with jitter on top, and a redelivery model that rides out a sustained limit without burning attempts.
If your backend calls an external API, you will eventually be rate limited. It rarely arrives as a clean failure — it shows up as a slow trickle of 429s that becomes a flood the moment you scale out. The instinct is to add a retry. The problem is that naive retries, spread across a worker fleet, make the throttling worse, not better. This post is about the strategies that actually hold up: how the common algorithms behave, why coordination beats cleverness, and how to keep a whole fleet under a ceiling regardless of how many workers you run.
Rate limits are not an LLM problem
LLM APIs made rate limiting a daily concern for a lot of teams, but they didn't invent the problem. OpenAI enforces requests-per-minute and tokens-per-minute per organization. Anthropic does the same on its Messages API, with separate input and output token budgets. But the moment your backend fans work out to any third party, you inherit that party's limits:
- OpenAI / Anthropic — RPM and TPM ceilings per org, per model. A bulk summarization job hits these fast.
- Stripe — request-rate limits per account; bulk reconciliation or migration scripts trip them routinely.
- GitHub, Shopify, Salesforce — per-token or per-app quotas, often with hourly windows.
- Internal services — your own downstream APIs have finite capacity too, even if no one wrote down a number.
The shape is always the same: a finite budget, a window, and a 429 (or 503/529) when you exceed it. So the strategy is the same too, whether the upstream is gpt-4o-mini, claude-sonnet-4-6, or a payments API.
Why per-worker limits multiply into storms
Here is the single most common mistake. You read that the upstream allows 100 requests per minute, so you configure your worker to allow 100 requests per minute. Then you scale to 8 workers for throughput. Each worker is individually compliant — and your fleet is now sending up to 800 requests per minute at an upstream that allows 100.
The throttling that follows is bad enough, but the recovery is worse. When the window resets, all 8 workers — which backed off independently — wake up around the same time, resume at full rate, and immediately re-saturate the limit. You haven't smoothed your traffic; you've turned a steady stream into a sawtooth of bursts and 429s. The math is unforgiving:
| Workers | Per-worker limit | Effective fleet rate | Upstream ceiling | Result |
|---|---|---|---|---|
| 1 | 100/min | 100/min | 100/min | Compliant |
| 4 | 100/min | 400/min | 100/min | 4x over — constant 429s |
| 8 | 100/min | 800/min | 100/min | 8x over — retry storm |
| 8 | 12/min (100/8) | 96/min | 100/min | Compliant, but breaks on rescale |
Dividing the limit by worker count (the last row) technically works until the moment an autoscaler adds a worker, a deploy doubles your pods during a rollout, or one worker restarts and briefly overlaps with its replacement. The limit has to be expressed once, for the whole fleet — not once per worker.
Any per-worker limit assumes you know the worker count at config time. Autoscalers, blue-green deploys, and crash-restarts all change that count without telling your rate limiter. A budget that's correct at 4 workers silently becomes 2x over the ceiling the instant you scale to 8.
Fixed window vs token bucket vs sliding window
Once you've decided the budget is shared, you pick how to enforce it. Three algorithms dominate, and each has a distinct failure mode:
| Strategy | How it works | Strength | Weakness |
|---|---|---|---|
| Fixed window | Count requests in a discrete interval (e.g. per 60s); reset at the boundary. | Trivial to implement; cheap state (one counter). | Allows up to 2x burst across a window boundary. |
| Token bucket | Refill tokens at a steady rate up to a max; each request spends one. | Smooths sustained throughput; allows controlled bursts. | Two parameters (rate + bucket size) to tune correctly. |
| Sliding window | Track timestamps over a rolling interval; count what's still inside it. | No boundary-burst problem; most accurate. | More state and computation per request. |
The boundary-burst problem in fixed window is worth understanding because it surprises people. If your window is one minute and your limit is 100, a client can send 100 requests at 11:59:59 and another 100 at 12:00:01 — 200 requests in two seconds, both windows technically compliant. Token bucket and sliding window don't have this gap.
In practice, the algorithm matters less than people expect. A fixed-window limiter set conservatively below the upstream ceiling absorbs the boundary burst without ever crossing the real limit. The thing that actually determines whether you get throttled is whether the budget is shared across the fleet — not which of these three counts the requests.
Respect Retry-After, then add jitter
When an upstream throttles you, it usually tells you exactly when to come back. OpenAI returns retry-after-ms; Anthropic and most others return a Retry-After header in seconds; Stripe returns 429s you should back off on. This header is not a suggestion — it's the upstream's own scheduler telling you when capacity returns.
- 1Read Retry-After and treat it as a floor. Never retry sooner. Retrying early keeps you throttled longer and wastes an attempt.
- 2Add jitter on top. If 50 requests were all throttled in the same window, they all received the same Retry-After. Without jitter, they resume on the same millisecond and re-trigger the storm.
- 3Fall back to exponential backoff only when there's no header. Some upstreams 503 without a Retry-After; that's when your own backoff-with-jitter takes over.
There's a deeper point here about classification. A 429 or 503 means 'come back later' — the work is still valid. A 400 or 401 means 'this will never succeed' — retrying is a bug. Treat backpressure (429/503/529) as a reason to wait, not a reason to fail, and treat client errors as permanent. Mixing them up is how you get either infinite retry loops or jobs that vanish on the first hiccup.
Coordinating across a fleet
Sharing a budget across many workers means putting the limiter somewhere all of them can see. There are two classic ways to do it, and one managed shortcut:
- Shared bucket in a central store. All workers consume from one Redis-backed counter or bucket keyed by upstream (e.g. anthropic-prod). Correct, but you own the store, the atomic decrement logic, the Retry-After pausing, and the failure modes when Redis itself is slow.
- Single dispatcher. One process owns the budget and hands work to a worker pool that never calls the upstream directly. Simpler to reason about, but the dispatcher is now a bottleneck and a single point of failure.
- A queue with a built-in per-queue limit. Hand the budget to the transport layer and stop coordinating in application code entirely.
The third option is what SimpleQ does. You create a queue with a fixed-window rate limit (rateLimitMax over rateLimitWindow), and every job in that queue counts against the same window — regardless of how many workers consume from it. Scaling from 2 workers to 20 doesn't change the budget, because the budget lives with the queue, not the worker.
1# One shared budget for an upstream API, enforced regardless of worker count.2curl -X POST https://api.simpleq.io/v1/queues \3 -H "Authorization: Bearer sq_live_..." \4 -H "Content-Type: application/json" \5 -d '{6 "name": "anthropic-jobs",7 "rateLimitMax": 50,8 "rateLimitWindow": 60,9 "maxAttempts": 8,10 "backoffType": "exponential",11 "backoffDelay": 212 }'13 14# Enqueue work — workers receive it via your webhook, all under the same 50/60s budget.15curl -X POST https://api.simpleq.io/v1/queues/anthropic-jobs/jobs \16 -H "Authorization: Bearer sq_live_..." \17 -H "Content-Type: application/json" \18 -d '{19 "payload": {20 "model": "claude-sonnet-4-6",21 "input": "Summarize this support ticket..."22 },23 "idempotencyKey": "ticket-9281-summary"24 }'SimpleQ is push-based: it durably stores the job and POSTs it to your own worker endpoint, where you run the actual API call. The official TypeScript SDK is @simpleq/sdk, and because the API is HTTP-first, any language works. There are queue templates for anthropic and openai that come pre-tuned for those providers' limits.
Riding out a sustained limit without burning attempts
Even a perfectly-sized shared budget can hit a wall when an upstream tightens capacity or you're processing a large backlog. The naive response is to count every throttle as a failed attempt — which means a sustained rate limit slowly eats through your maxAttempts cap and dead-letters jobs that would have succeeded if they'd just waited.
The better model separates failure from backpressure. When your worker sees a 429/503/529 from the upstream, it shouldn't report a failure — it should signal backpressure and let the transport redeliver later. In SimpleQ's three-signal ack protocol, that's a defer with a retryAfter: the job is rescheduled, and no attempt is burned. A job can ride out a sustained rate limit through many defers and still complete on the first real attempt once capacity returns.
| Signal | Meaning | Attempt burned? |
|---|---|---|
| ack | Work succeeded. | — |
| nack (retryable) | Real failure; retry with backoff. | Yes |
| nack (non-retryable) | Permanent failure; send to DLQ. | Yes (terminal) |
| defer (retryAfter) | Backpressure / rate limited; reschedule. | No |
When the upstream returns Retry-After: 30, defer the job with retryAfter: 30. You're forwarding the upstream's own scheduling decision into the transport layer — the job comes back exactly when capacity does, and the attempt counter is untouched.
Putting it together
A rate-limiting strategy that survives production isn't one algorithm — it's a small stack of decisions:
- 1One shared budget per upstream, sized below the real ceiling, expressed once for the whole fleet.
- 2Retry-After respected as a floor, with jitter layered on so a throttled fleet doesn't resume in lockstep.
- 3Clean classification — backpressure (429/503/529) waits; client errors (400/401) fail fast and don't retry.
- 4Backpressure that doesn't burn attempts, so a sustained limit delays work instead of dead-lettering it.
- 5Idempotency at the publish boundary, so retries and redeliveries don't double-charge or double-write.
Build it yourself with Redis and a careful classifier, or hand the budget to the transport. If you want the latter, SimpleQ gives you a per-queue shared rate limit, configurable backoff, a defer signal for backpressure that doesn't cost an attempt, and a dead-letter queue with replay — so your workers just make the call and report the outcome. See the bulk API sync use case for an end-to-end example of keeping a fleet under an upstream ceiling. For provider-specific depth, the companion posts on handling OpenAI rate limits in production and backpressure with 429/503/529 responses go deeper on the mechanics.
Frequently asked questions
Ship reliable async work in minutes.
Free tier covers 10,000 job executions a month. No credit card.