Without retries, your background jobs are about 92% 'done' in practice. Workers crash, AI APIs 429, networks hiccup, deploys happen mid-flight. Adding retries with exponential backoff, jitter, idempotency keys, and a dead-letter queue takes you from 92% to 99.99% — and gives you a record of the jobs you couldn't complete instead of silently dropping them. This post covers the four primitives, the retry math behind them, and the bugs they catch.
Running a background job looks trivial until you ship it at volume. You pull a job off the queue, call an API, write a row, mark it done. Except the API returned a 429. Or your worker OOM-killed mid-write. Or the call succeeded and your process crashed before recording it — so you'll run it again and bill OpenAI twice. Or you gave up after one try and have no record of what didn't finish.
Reliable job delivery is the core problem behind every AI-heavy or API-dependent backend: you have an unreliable external dependency (an LLM, a payment API, your own worker), you need to keep trying without making things worse, and you need a record of what failed. This post is about the four primitives that get it right — and they apply equally whether the job calls OpenAI's gpt-4o-mini, Anthropic's claude-sonnet-4-6, or your own internal service.
"Done" vs actually done
If you graph the outcomes of a busy job worker that has no retry logic, the picture looks something like this:
- ~92% — succeeds on the first attempt
- ~5% — transient failure (upstream 5xx, 429, timeout) that would succeed on retry
- ~2% — slow recovery (a dependency's deploy or maintenance window), eventually succeeds
- ~1% — never recoverable (bad input, deleted record, permanent auth failure)
If you only run each job once, your real completion rate is 92%. That sounds high until you realize a service doing 10,000 jobs a day is dropping roughly 700 of them — and most of those failures are recoverable noise, not real errors. Retries take you toward 99.99% without changing a line of your business logic. The remaining ~1% is the work that actually needs a human, and the whole point of the primitives below is to make sure you can see exactly which jobs those are.
Primitive 1: Exponential backoff with jitter
The retry curve has to be aggressive at first (catch transient blips quickly) and patient later (don't hammer a dependency that's down). A typical schedule for jobs that hit an external API:
| Attempt | Delay before retry |
|---|---|
| 1 | 10 seconds |
| 2 | 1 minute |
| 3 | 5 minutes |
| 4 | 30 minutes |
| 5 | 2 hours |
| 6 | 6 hours |
| 7 | 24 hours |
| 8 | Dead-letter |
The doubling is the easy part. Jitter is the part most homegrown retry loops skip — and it's the part that matters most under load. Add a random offset at each step (full jitter, or uniform within +/-25% of the scheduled delay) so that when thousands of jobs failed during the same OpenAI or Anthropic outage, they don't all wake up at the same instant and re-saturate the API the moment it recovers.
1function nextDelayMs(attempt: number, opts: {2 initialMs: number;3 maxMs: number;4}): number {5 const exp = Math.min(opts.maxMs, opts.initialMs * 2 ** attempt);6 // Full jitter: uniformly random in [0, exp]7 return Math.floor(Math.random() * exp);8}9 10// attempt 0 -> 0-1000ms, attempt 1 -> 0-2000ms, attempt 5 -> 0-32000ms (capped)Full jitter (uniform in [0, exp]) is provably as good or better than partial jitter for spreading out a recovering herd — see the AWS Architecture Blog's seminal post on the subject. SimpleQ configures this per queue: choose exponential or fixed backoff and a maxAttempts cap (up to 20), and every job inherits the schedule.
A 429/503/529 with a Retry-After header isn't a failure — it's the dependency telling you to wait. Burning a retry attempt on it is wasteful, and burning all of them turns a temporary rate limit into a dead-lettered job. With SimpleQ, your worker replies with /defer and a retryAfter, and the job is redelivered later without consuming an attempt. A single job can ride out a sustained OpenAI rate limit across many defers and still complete on its original budget.
Primitive 2: Idempotency
Any retry-capable system will eventually deliver the same job twice. The classic case: your worker processed the job — called the LLM, wrote the row — but crashed (or a proxy timed out the response) before reporting success. The queue can't tell "done but unacknowledged" from "never ran," so it retries. Now you've made two paid Anthropic calls, or sent two emails, for one job. The fix lives on both sides of the boundary:
- Publish boundary. Dedupe the enqueue itself so a retried POST doesn't create two jobs. SimpleQ does this with an
idempotencyKeyon publish — same key within the window collapses to one job. - Handler boundary. Make the work itself idempotent. Key your side effects (the DB write, the charge, the outbound API call) on the stable job ID so a second delivery is a safe no-op rather than a duplicate.
The alternatives to a stable key — deduping on payload hash, timestamp, or user-id+action — are all subtly wrong and break in exactly the cases that matter. Pick one stable ID per unit of work (the order ID, the message ID, a UUID you generate at enqueue), thread it through both boundaries, and at-least-once delivery becomes effectively-once. We go deeper in idempotency keys for jobs.
Primitive 3: Dead-letter queue
After max attempts, the job has to go somewhere visible. A dead-letter queue (DLQ) is exactly that: a holding pen for jobs that exhausted their retries. The two non-negotiables:
- Inspect. You should be able to filter the DLQ by queue and date and see the full attempt history per job — every status code, every error body, every timestamp. That history is usually enough to tell a deleted-record permanent failure from an upstream-was-down transient one.
- Replay. You should be able to re-run jobs from the DLQ — single or in bulk. Most DLQ entries become completable once you fix the underlying cause (a bug, an expired key, a schema change), and replay means re-running the affected jobs instead of reconstructing them from logs.
Without a DLQ, exhausted retries simply vanish, and you discover the gap weeks later when a customer asks where their result is. With one, the worst case is "these 40 jobs are waiting for a fix" — which is a Tuesday, not an incident. SimpleQ provides a per-queue DLQ with single and bulk replay; the full pattern is in dead-letter queues explained.
Primitive 4: Signed, verified delivery
A push-based queue delivers jobs by POSTing to your worker's URL — which means your worker is an endpoint on the internet, and it needs to know that an incoming job is really from the queue and not a forged request. The standard answer is HMAC-SHA256 over the raw request body:
1import crypto from "node:crypto";2 3// Verify the x-simpleq-signature header before doing any work.4export function verifyJob(rawBody: string, signature: string, secret: string) {5 const expected = crypto6 .createHmac("sha256", secret)7 .update(rawBody) // sign the RAW body, before JSON.parse8 .digest("hex");9 const ok = crypto.timingSafeEqual(10 Buffer.from(expected),11 Buffer.from(signature),12 );13 if (!ok) throw new Error("bad signature");14 return JSON.parse(rawBody);15}SimpleQ signs every delivery with a per-queue secret and sends it in the x-simpleq-signature header, computed over the raw body. Verify it against the unparsed bytes — recompute after JSON parsing and re-serializing and the signature won't match. Use per-queue secrets, not one global secret: if one leaks, you rotate one queue, not your whole fleet.
Common gotchas
A 200 from a proxy isn't a 200 from your handler
If a load balancer or reverse proxy in front of your worker buffers the request and returns 200 before your application has run, the queue thinks the job succeeded while your code never saw it. The result is silent data loss that retries can't catch. Acknowledge success from inside your handler after the work is durable — never from infrastructure that sits in front of it.
Slow jobs and webhook timeouts
An LLM completion can take 30-90 seconds — far longer than a standard synchronous webhook timeout (SimpleQ's is a hard 15 seconds). If you try to finish the work before returning, the delivery times out, counts as a failure, and gets retried even though it would have succeeded. The fix is to decouple the ack from the work: return 200 fast to accept the job, then report the outcome when it's actually done.
Retrying things that will never succeed
Not every failure deserves a retry. A 400, a 401, a content-policy 403, or a missing-record error will fail identically on every attempt — retrying just wastes the budget and delays the trip to the DLQ. Classify failures and tell the queue which are retryable so permanent errors short-circuit straight to dead-letter.
| Outcome | What to report |
|---|---|
| Success | ack — job complete |
| 429 / 503 / 529 + Retry-After | defer with retryAfter — no attempt burned |
| 500, 502, 504, timeout | nack retryable=true — backoff and retry |
| 400, 401, 403, 404 | nack retryable=false — straight to DLQ |
The three-signal ack protocol
All four primitives above need one thing to work: an honest signal from your worker about what actually happened. A single "did it return 200" boolean can't distinguish a real failure from backpressure, or a retryable error from a permanent one. SimpleQ uses three explicit signals your handler reports back per job:
- 1ack —
POST /v1/jobs/:id/ack. The work is durably done. Stop retrying. - 2nack —
POST /v1/jobs/:id/nackwith aretryableflag. It failed; retry it on the backoff schedule (retryable=true) or send it straight to the DLQ (retryable=false). - 3defer —
POST /v1/jobs/:id/deferwithretryAfterseconds. The dependency pushed back; redeliver later without burning an attempt.
Pair that with ack mode for long-running work — return 200 within the webhook window to accept the job, then report the real outcome up to 300s later (the openai template) or 600s (the anthropic template). That's how a single LLM job survives a slow model, a rate limit, and a transient 503 without ever being falsely retried or silently lost.
What this looks like with SimpleQ
You don't assemble these four primitives by hand. You create a queue with the policy, then enqueue jobs against it — backoff, jitter, the retry cap, signing, and the DLQ all come from the queue config:
1curl -X POST https://api.simpleq.io/v1/queues/ai-jobs/jobs \2 -H "Authorization: Bearer sq_live_..." \3 -H "Content-Type: application/json" \4 -d '{5 "payload": {6 "model": "gpt-4o-mini",7 "messages": [{ "role": "user", "content": "Summarize this document..." }],8 "max_tokens": 5129 },10 "idempotencyKey": "summary_doc-123"11 }'The official TypeScript SDK — @simpleq/sdk on npm — is available today, with more language SDKs coming. The API is HTTP-first underneath, so any language that can make a request works. SimpleQ durably stores the job, POSTs it to your own worker with a signed body, and your handler replies with ack, nack, or defer. Backoff and jitter, the maxAttempts cap, and the dead-letter queue with single and bulk replay are all part of the queue you created.
That's the whole shape: idempotency key on publish, signed delivery to your endpoint, retries with jittered backoff on the queue, backpressure handled by defer instead of burned attempts, and a visible DLQ for the jobs that genuinely need you. If you want the conceptual tour of how this fits together, read what is a managed job queue.
If you'd rather not build backoff, jitter, idempotency, and a dead-letter queue yourself, SimpleQ gives you all four as a managed transport that delivers to your own worker. See the AI job processing use case for an end-to-end example.
Frequently asked questions
Ship reliable async work in minutes.
Free tier covers 10,000 job executions a month. No credit card.