How do I handle 429s across multiple workers?

Don't let each worker make its own retry decision against its own local limiter — that's what multiplies into storms. Put one shared budget in front of the upstream, keyed by the API (e.g. openai-prod or stripe-prod), and have every worker draw from it. When the upstream returns a 429 or 503 with Retry-After, pause the shared budget for that duration rather than retrying per worker. SimpleQ does this with a per-queue rate limit: every job in the queue counts against one fixed window no matter how many workers consume it.

Should I respect the Retry-After header or use my own backoff?

Respect Retry-After as a floor, always. It's the upstream telling you exactly when capacity returns; retrying sooner keeps you throttled longer and wastes attempts. Layer jitter on top of it so a fleet that was throttled together doesn't all resume on the same millisecond. Use your own exponential backoff only when there's no Retry-After header to read.

Why do per-worker rate limits cause retry storms?

If you configure each worker to allow, say, 50 requests per minute and you run 10 workers, your effective limit against the upstream is 500 per minute — 10x what you intended. Each worker is individually compliant and the fleet collectively blows the ceiling. When the upstream throttles, every worker backs off independently, then they all resume around the same window reset and re-saturate the limit. The limit has to be expressed once for the whole fleet, not once per worker.

Does rate limiting only matter for LLM APIs like OpenAI and Anthropic?

No. LLM APIs make it visible because their limits (requests-per-minute and tokens-per-minute) are tight relative to demand, but Stripe, Shopify, GitHub, search APIs, and internal services all enforce limits too. Any backend that fans work out to a third-party API needs a shared budget and Retry-After handling. The strategy is identical regardless of whether the upstream is an LLM or a payments API.

Rate limiting strategies for API-dependent backends

TL;DR

Every API you depend on — OpenAI, Anthropic, Stripe, GitHub — enforces a rate limit. The hard part isn't the algorithm (fixed window, token bucket, sliding window all work); it's coordination. Per-worker limits multiply by worker count and turn throttling into self-inflicted retry storms. The pattern that scales: one shared budget in front of each upstream, Retry-After respected as a floor with jitter on top, and a redelivery model that rides out a sustained limit without burning attempts.

If your backend calls an external API, you will eventually be rate limited. It rarely arrives as a clean failure — it shows up as a slow trickle of 429s that becomes a flood the moment you scale out. The instinct is to add a retry. The problem is that naive retries, spread across a worker fleet, make the throttling worse, not better. This post is about the strategies that actually hold up: how the common algorithms behave, why coordination beats cleverness, and how to keep a whole fleet under a ceiling regardless of how many workers you run.

Rate limits are not an LLM problem

LLM APIs made rate limiting a daily concern for a lot of teams, but they didn't invent the problem. OpenAI enforces requests-per-minute and tokens-per-minute per organization. Anthropic does the same on its Messages API, with separate input and output token budgets. But the moment your backend fans work out to any third party, you inherit that party's limits:

OpenAI / Anthropic — RPM and TPM ceilings per org, per model. A bulk summarization job hits these fast.
Stripe — request-rate limits per account; bulk reconciliation or migration scripts trip them routinely.
GitHub, Shopify, Salesforce — per-token or per-app quotas, often with hourly windows.
Internal services — your own downstream APIs have finite capacity too, even if no one wrote down a number.

The shape is always the same: a finite budget, a window, and a 429 (or 503/529) when you exceed it. So the strategy is the same too, whether the upstream is gpt-4o-mini, claude-sonnet-4-6, or a payments API.

Why per-worker limits multiply into storms

Here is the single most common mistake. You read that the upstream allows 100 requests per minute, so you configure your worker to allow 100 requests per minute. Then you scale to 8 workers for throughput. Each worker is individually compliant — and your fleet is now sending up to 800 requests per minute at an upstream that allows 100.

The throttling that follows is bad enough, but the recovery is worse. When the window resets, all 8 workers — which backed off independently — wake up around the same time, resume at full rate, and immediately re-saturate the limit. You haven't smoothed your traffic; you've turned a steady stream into a sawtooth of bursts and 429s. The math is unforgiving:

Workers	Per-worker limit	Effective fleet rate	Upstream ceiling	Result
1	100/min	100/min	100/min	Compliant
4	100/min	400/min	100/min	4x over — constant 429s
8	100/min	800/min	100/min	8x over — retry storm
8	12/min (100/8)	96/min	100/min	Compliant, but breaks on rescale

Dividing the limit by worker count (the last row) technically works until the moment an autoscaler adds a worker, a deploy doubles your pods during a rollout, or one worker restarts and briefly overlaps with its replacement. The limit has to be expressed once, for the whole fleet — not once per worker.

Autoscaling makes per-worker limits a time bomb

Any per-worker limit assumes you know the worker count at config time. Autoscalers, blue-green deploys, and crash-restarts all change that count without telling your rate limiter. A budget that's correct at 4 workers silently becomes 2x over the ceiling the instant you scale to 8.

Fixed window vs token bucket vs sliding window

Once you've decided the budget is shared, you pick how to enforce it. Three algorithms dominate, and each has a distinct failure mode:

Strategy	How it works	Strength	Weakness
Fixed window	Count requests in a discrete interval (e.g. per 60s); reset at the boundary.	Trivial to implement; cheap state (one counter).	Allows up to 2x burst across a window boundary.
Token bucket	Refill tokens at a steady rate up to a max; each request spends one.	Smooths sustained throughput; allows controlled bursts.	Two parameters (rate + bucket size) to tune correctly.
Sliding window	Track timestamps over a rolling interval; count what's still inside it.	No boundary-burst problem; most accurate.	More state and computation per request.

The boundary-burst problem in fixed window is worth understanding because it surprises people. If your window is one minute and your limit is 100, a client can send 100 requests at 11:59:59 and another 100 at 12:00:01 — 200 requests in two seconds, both windows technically compliant. Token bucket and sliding window don't have this gap.

In practice, the algorithm matters less than people expect. A fixed-window limiter set conservatively below the upstream ceiling absorbs the boundary burst without ever crossing the real limit. The thing that actually determines whether you get throttled is whether the budget is shared across the fleet — not which of these three counts the requests.

Respect Retry-After, then add jitter

When an upstream throttles you, it usually tells you exactly when to come back. OpenAI returns retry-after-ms; Anthropic and most others return a Retry-After header in seconds; Stripe returns 429s you should back off on. This header is not a suggestion — it's the upstream's own scheduler telling you when capacity returns.

1Read Retry-After and treat it as a floor. Never retry sooner. Retrying early keeps you throttled longer and wastes an attempt.
2Add jitter on top. If 50 requests were all throttled in the same window, they all received the same Retry-After. Without jitter, they resume on the same millisecond and re-trigger the storm.
3Fall back to exponential backoff only when there's no header. Some upstreams 503 without a Retry-After; that's when your own backoff-with-jitter takes over.

There's a deeper point here about classification. A 429 or 503 means 'come back later' — the work is still valid. A 400 or 401 means 'this will never succeed' — retrying is a bug. Treat backpressure (429/503/529) as a reason to wait, not a reason to fail, and treat client errors as permanent. Mixing them up is how you get either infinite retry loops or jobs that vanish on the first hiccup.

Coordinating across a fleet

Sharing a budget across many workers means putting the limiter somewhere all of them can see. There are two classic ways to do it, and one managed shortcut:

Shared bucket in a central store. All workers consume from one Redis-backed counter or bucket keyed by upstream (e.g. anthropic-prod). Correct, but you own the store, the atomic decrement logic, the Retry-After pausing, and the failure modes when Redis itself is slow.
Single dispatcher. One process owns the budget and hands work to a worker pool that never calls the upstream directly. Simpler to reason about, but the dispatcher is now a bottleneck and a single point of failure.
A queue with a built-in per-queue limit. Hand the budget to the transport layer and stop coordinating in application code entirely.

The third option is what SimpleQ does. You create a queue with a fixed-window rate limit (rateLimitMax over rateLimitWindow), and every job in that queue counts against the same window — regardless of how many workers consume from it. Scaling from 2 workers to 20 doesn't change the budget, because the budget lives with the queue, not the worker.

create-rate-limited-queue.sh

bash

1# One shared budget for an upstream API, enforced regardless of worker count.
2curl -X POST https://api.simpleq.io/v1/queues \
3  -H "Authorization: Bearer sq_live_..." \
4  -H "Content-Type: application/json" \
5  -d '{
6    "name": "anthropic-jobs",
7    "rateLimitMax": 50,
8    "rateLimitWindow": 60,
9    "maxAttempts": 8,
10    "backoffType": "exponential",
11    "backoffDelay": 2
12  }'
13 
14# Enqueue work — workers receive it via your webhook, all under the same 50/60s budget.
15curl -X POST https://api.simpleq.io/v1/queues/anthropic-jobs/jobs \
16  -H "Authorization: Bearer sq_live_..." \
17  -H "Content-Type: application/json" \
18  -d '{
19    "payload": {
20      "model": "claude-sonnet-4-6",
21      "input": "Summarize this support ticket..."
22    },
23    "idempotencyKey": "ticket-9281-summary"
24  }'

SimpleQ is push-based: it durably stores the job and POSTs it to your own worker endpoint, where you run the actual API call. The official TypeScript SDK is @simpleq/sdk, and because the API is HTTP-first, any language works. There are queue templates for anthropic and openai that come pre-tuned for those providers' limits.

Riding out a sustained limit without burning attempts

Even a perfectly-sized shared budget can hit a wall when an upstream tightens capacity or you're processing a large backlog. The naive response is to count every throttle as a failed attempt — which means a sustained rate limit slowly eats through your maxAttempts cap and dead-letters jobs that would have succeeded if they'd just waited.

The better model separates failure from backpressure. When your worker sees a 429/503/529 from the upstream, it shouldn't report a failure — it should signal backpressure and let the transport redeliver later. In SimpleQ's three-signal ack protocol, that's a defer with a retryAfter: the job is rescheduled, and no attempt is burned. A job can ride out a sustained rate limit through many defers and still complete on the first real attempt once capacity returns.

Signal	Meaning	Attempt burned?
ack	Work succeeded.	—
nack (retryable)	Real failure; retry with backoff.	Yes
nack (non-retryable)	Permanent failure; send to DLQ.	Yes (terminal)
defer (retryAfter)	Backpressure / rate limited; reschedule.	No

Map the upstream's Retry-After straight into defer

When the upstream returns Retry-After: 30, defer the job with retryAfter: 30. You're forwarding the upstream's own scheduling decision into the transport layer — the job comes back exactly when capacity does, and the attempt counter is untouched.

Putting it together

A rate-limiting strategy that survives production isn't one algorithm — it's a small stack of decisions:

1One shared budget per upstream, sized below the real ceiling, expressed once for the whole fleet.
2Retry-After respected as a floor, with jitter layered on so a throttled fleet doesn't resume in lockstep.
3Clean classification — backpressure (429/503/529) waits; client errors (400/401) fail fast and don't retry.
4Backpressure that doesn't burn attempts, so a sustained limit delays work instead of dead-lettering it.
5Idempotency at the publish boundary, so retries and redeliveries don't double-charge or double-write.

Build it yourself with Redis and a careful classifier, or hand the budget to the transport. If you want the latter, SimpleQ gives you a per-queue shared rate limit, configurable backoff, a defer signal for backpressure that doesn't cost an attempt, and a dead-letter queue with replay — so your workers just make the call and report the outcome. See the bulk API sync use case for an end-to-end example of keeping a fleet under an upstream ceiling. For provider-specific depth, the companion posts on handling OpenAI rate limits in production and backpressure with 429/503/529 responses go deeper on the mechanics.

Frequently asked questions

Fixed window counts requests in discrete intervals (e.g. per minute) and resets at the boundary — simple, but it allows bursts at the edges. Token bucket refills tokens at a steady rate and lets you spend a burst up to the bucket size, which smooths traffic and naturally caps sustained throughput. Sliding window tracks requests over a rolling interval, so it avoids the boundary-burst problem of fixed window at the cost of more state. For most API-dependent backends, the question that matters more than the algorithm is whether the limit is shared across all your workers.

Try SimpleQ