A dead-letter queue (DLQ) is where jobs go after they exhaust retries or are rejected as non-retryable. The point is that exhausted jobs land somewhere inspectable instead of vanishing. Capture status, error, HTTP code, attempt number, and timestamp per attempt; use that trail to find root cause; then replay — one job to test a fix, or in bulk once the fix is shipped. A DLQ plus a per-attempt audit trail turns debugging from grep-and-prayer into a repeatable loop.
Every backend that runs work in the background has a quiet failure mode: a job retries, retries again, runs out of attempts, and then disappears. No error surfaces to the user, nothing breaks loudly, and the only evidence is a line buried in a log stream that rotated out a week ago. The customer's invoice never sent. The embedding never got written. You find out when someone files a support ticket. A dead-letter queue is the fix for that failure mode, and this post is about how to use one well.
What a dead-letter queue actually is
A dead-letter queue is a separate holding area for jobs that could not be processed successfully. When a job exhausts its retry budget — or your worker explicitly says "this will never succeed, stop trying" — it doesn't get dropped. It moves into the DLQ, where it sits, intact, with its original payload and its full failure history attached.
The name comes from messaging systems, but the concept is simple: dead means "we've stopped automatically retrying this," and the letter is the job itself. Two questions decide whether a DLQ is worth anything:
- Is the payload preserved? If the DLQ only holds an error string, you can read about the failure but you can't re-run the work. A useful DLQ keeps the full original job so it can be replayed verbatim.
- Is the failure history attached? A job that landed in the DLQ after one timeout is a different problem from one that failed the same way on all five attempts. Without the per-attempt trail, every DLQ entry looks identical, and you're back to guessing.
SimpleQ provisions a DLQ automatically for every queue you create. You don't configure a separate destination or wire up a routing rule — when a job in ai-jobs exhausts its retries, it lands in that queue's DLQ with its payload and attempt history intact, ready to inspect or replay.
Why a vanished job is worse than a loud crash
A crash gets attention. An exception bubbles up, a pager goes off, someone looks. A job that quietly exhausts its retries gets none of that — it's the difference between a fire alarm and a slow leak. The slow leak is more dangerous precisely because it's silent.
Consider a worker that calls OpenAI or Anthropic to generate a summary, write it to your database, and notify the user. The downstream API has a bad ten minutes and returns 500s. Your retries fire — three, four, five attempts — and every one lands during the outage. The job exhausts maxAttempts and, without a DLQ, it's gone. The work was completely recoverable: the API came back two minutes later. But you no longer have the payload to retry it with.
Retries handle transient failures that resolve within the retry window. They do nothing for failures that outlast the window, or for bugs that fail deterministically every attempt. A DLQ is the safety net under your retry policy — it catches everything retries couldn't, so the worst case is "replay it later" instead of "it's gone." If you're still tuning your retry policy itself, see Why job retries matter.
What lands in the DLQ — and what doesn't
Not every failure should dead-letter, and conflating the categories is how DLQs fill up with noise. SimpleQ uses a three-signal ack protocol that maps cleanly onto the three outcomes a job can have:
| Worker signal | Meaning | Effect on the DLQ |
|---|---|---|
| ack | Success — work is done | Job completes, never touches the DLQ |
| nack (retryable: true) | Transient failure, try again | Retries with backoff; dead-letters only after maxAttempts |
| nack (retryable: false) | Permanent failure, stop | Dead-letters immediately — no point retrying |
| defer (retryAfter) | Backpressure, not failure | Redelivered later; no attempt burned, never dead-lettered for this |
That defer row is the one teams miss. When a downstream API returns a 429 or 503 with a Retry-After header, that isn't a failure — it's the API asking you to wait. Burning a retry attempt on it is how a job that was always going to succeed ends up dead-lettered during a rate-limit spike. SimpleQ treats a deferral as backpressure: the job is redelivered after the requested delay, no attempt is consumed, and it can ride out a sustained rate limit and still complete. The DLQ stays reserved for genuine exhaustion and genuine permanent errors.
So a job dead-letters in exactly two cases: it nacked as retryable enough times to hit maxAttempts (capped at 20), or your worker nacked it as non-retryable on a permanent error — a 400 from a downstream API, a malformed payload, a validation failure that no number of retries will fix.
The per-attempt audit trail
A DLQ entry that just says "failed" is barely better than a log line. The value is in the history of how it failed. For each delivery attempt, you want four things recorded:
- 1HTTP status your worker returned (or the timeout, if it never responded).
- 2Error body or message — the actual reason, not just a category.
- 3Attempt number — was this the first try or the fifth?
- 4Timestamp — when it ran, so you can correlate against a known outage window.
With those four fields per attempt, the failure tells its own story. Compare two DLQ entries from the same queue:
1// Job A — transient: a real outage that outlasted the retry window2{3 "id": "job_8fa2",4 "attempts": [5 { "n": 1, "ts": "2026-06-05T14:02:11Z", "status": 503, "error": "upstream unavailable" },6 { "n": 2, "ts": "2026-06-05T14:02:41Z", "status": 503, "error": "upstream unavailable" },7 { "n": 3, "ts": "2026-06-05T14:03:51Z", "status": 503, "error": "upstream unavailable" }8 ]9}10 11// Job B — deterministic bug: same error every time, retries never had a chance12{13 "id": "job_8fb9",14 "attempts": [15 { "n": 1, "ts": "2026-06-05T14:02:12Z", "status": 500, "error": "TypeError: cannot read 'id' of undefined" },16 { "n": 2, "ts": "2026-06-05T14:02:42Z", "status": 500, "error": "TypeError: cannot read 'id' of undefined" },17 { "n": 3, "ts": "2026-06-05T14:03:52Z", "status": 500, "error": "TypeError: cannot read 'id' of undefined" }18 ]19}Job A is a replay candidate: the outage is over, re-running it will probably succeed, and you don't need to touch any code. Job B is a bug report: the same TypeError on every attempt means there is no point replaying until you ship a fix. You can tell these apart in seconds — and crucially, without writing a single line of instrumentation, because SimpleQ records this attempt history for every job automatically.
The inspect → fix → replay loop
A DLQ isn't a graveyard you check once a quarter. It's the input to a loop you run whenever jobs accumulate. The loop has three steps:
- Inspect. Open the DLQ, group entries by error signature, and read the attempt history. Distinguish transient (will succeed on replay as-is) from deterministic (needs a code or data fix first).
- Fix the root cause. For transient failures, the fix may be "nothing — the dependency recovered." For deterministic ones, ship the code change, correct the bad data, or fix the validation that should have caught the payload upstream.
- Replay. Re-deliver the failed jobs to your worker. Replay one to confirm the fix works, then replay the rest in bulk.
The discipline that makes this safe is idempotency. When you replay a batch, some of those jobs may have partially succeeded before failing — the database write landed but the notification didn't. If your worker isn't idempotent, replay creates duplicate side effects. A publish-boundary idempotency key on the original job, or an idempotent worker handler, means a replay is safe to run without double-charging anyone. We cover this in depth in Idempotency keys for jobs.
After shipping a fix, replay a single representative job from the DLQ and confirm it acks cleanly. Only then trigger the bulk replay. This catches the embarrassing case where your fix was right in theory but the deploy didn't actually go out, and saves you from re-failing the entire batch.
Single vs bulk replay
The two replay modes serve different moments in the loop, and using the right one keeps the process tight:
| Mode | When to use it | Why |
|---|---|---|
| Single replay | Verifying a fix; recovering one specific job | Tight feedback loop — you watch one job go green before risking the batch |
| Bulk replay | After a verified fix, clearing all jobs that failed the same way | One action drains the backlog instead of clicking through hundreds of entries |
A typical incident runs single-then-bulk. A downstream model API had a bad window and 400 jobs dead-lettered with the same 503 signature. You inspect, confirm they're all transient, replay one to verify the API has recovered, watch it ack, then bulk-replay the remaining 399. The backlog clears in one pass, and because each replayed job re-enters the normal pipeline, the same retry policy and rate limit apply on the way back out — so a still-flaky dependency won't cause a fresh stampede.
SimpleQ supports both single and bulk replay from the DLQ directly. A replayed job is re-delivered to your webhook as a fresh job with the original payload, so nothing about your worker code needs to change to support recovery.
From grep-and-prayer to a workflow
Without a DLQ, recovering a failed job is archaeology. You grep the logs for the job ID, hope the relevant lines haven't rotated out, reconstruct the payload by hand from whatever you logged, and re-submit it through some one-off script — praying you reconstructed it correctly and that the worker is idempotent enough to survive a guess. Multiply that by 400 jobs and recovery simply doesn't happen; the work is written off.
With a DLQ plus a per-attempt audit trail, the same incident is a workflow: open the queue, read why it failed, fix the cause, replay. The payload is preserved exactly, the failure history tells you whether to fix code or just re-run, and the replay button does the resubmission for you. Here's the producer side — the worker that POSTs jobs in — set up so that exhaustion has somewhere to land:
1import { SimpleQ } from "@simpleq/sdk";2 3const sq = new SimpleQ({ apiKey: process.env.SIMPLEQ_API_KEY! });4 5// The "ai-jobs" queue is configured once (webhook URL, maxAttempts,6// backoffType, DLQ) via POST /v1/queues or the dashboard — exhausted7// jobs land in its DLQ automatically with the full attempt trail.8 9// Publish a job. SimpleQ delivers it to YOUR worker over HTTP and,10// if it nacks past maxAttempts, dead-letters it.11await sq.publish("ai-jobs", {12 payload: {13 model: "gpt-4o-mini",14 messages: [{ role: "user", content: "Summarize this document..." }],15 max_tokens: 512,16 },17 // Dedupes the publish; also makes a later DLQ replay safe to run.18 idempotencyKey: "summary_doc-4821",19});Your worker receives the delivery, does the work, and signals the outcome with ack, nack, or defer. SimpleQ handles the retry schedule, the backpressure deferrals, the dead-lettering, and the attempt history. The SDK is TypeScript (@simpleq/sdk on npm) but the API is HTTP-first, so any language that can answer a webhook works.
SimpleQ is a managed, push-based job queue: you POST a job, it's durably stored, and it's delivered to your own worker with retries, per-queue rate limiting, backpressure handling, and a dead-letter queue with single and bulk replay built in. If you want exhausted jobs to land somewhere inspectable instead of vanishing — and a one-click path back into your pipeline once you've fixed the cause — see the use cases for end-to-end examples, and pair this with Why job retries matter and Idempotency keys for jobs for the full durability story.
Frequently asked questions
Ship reliable async work in minutes.
Free tier covers 10,000 job executions a month. No credit card.