Engineering

Running jobs longer than a webhook timeout: ack mode for async job acknowledgment

A synchronous webhook has a hard ~15-second ceiling, but LLM generations and slow third-party calls run longer. Ack mode lets you return 200 fast and report the real outcome out of band.

·10 min read
TL;DR

A synchronous webhook has a hard ~15-second ceiling, but LLM generations, video work, and slow third-party calls run longer. Ack mode fixes this: return 200 to confirm receipt, then report the real outcome out of band via POST /ack (success), /nack (failure, with a retryable flag), or /defer (backpressure, with retryAfter). ackTimeout sets the reporting deadline and ackTimeoutAction decides what happens if you miss it. Because work can be redelivered, make handlers idempotent. Use standard mode when work fits inside the timeout; use ack mode when it can't.

Push-based queues deliver work by making an HTTP request to your endpoint. That's clean and simple — until the work takes longer than the request is allowed to live. SimpleQ's standard delivery mode enforces a hard 15-second webhook timeout. A Claude or OpenAI generation, a video transcode, or a slow partner API can blow through that before it's anywhere near done. This post is about the mechanism that solves it: ack mode.

The 15-second wall

In standard delivery mode, the contract is synchronous and simple. SimpleQ POSTs the job to your webhook, your handler does the work, and you return a status code before the 15-second timeout expires. A 2xx means success. A non-2xx (or a timeout) means failure, and the job retries according to your queue's backoff policy.

That works beautifully for fast work. It falls apart for slow work. Consider what actually runs longer than 15 seconds:

  • LLM generations. A long completion from claude-sonnet-4-6 or gpt-4o-mini with a large output can take 30-120 seconds, and reasoning-heavy prompts go further.
  • Media work. Video transcoding, rendering, and thumbnail extraction routinely run for minutes.
  • Slow third-party calls. Document processing, payment settlement, KYC checks, and bulk imports against partner APIs that are themselves slow.
  • Chained external steps. A single job that calls two or three upstreams in sequence, each with its own latency.

The wrong fixes are tempting. You could hold the HTTP connection open and hope nothing times out — but proxies, load balancers, and the platform's own ceiling will cut you off. You could fire-and-forget from inside the handler and return 200 immediately — but then a crash mid-work silently loses the job, because the queue already saw your 200 and considers it delivered. Ack mode is the fix that keeps durability.

What ack mode actually does

Ack mode decouples receipt from outcome. Your endpoint does two things at two different times:

  1. 1Acknowledge receipt fast. When SimpleQ POSTs the job, your handler returns 200 quickly — it's just saying "I have this, I'm on it." Kick the actual work onto a background task and return.
  2. 2Report the outcome later. When the work finishes — seconds or minutes later — you make a second HTTP call back to SimpleQ telling it what happened.

That second call is one of three signals against the job id. This is the ack protocol, and it's the same vocabulary whether the work took 200 milliseconds or 9 minutes:

SignalEndpointMeaningBody
ackPOST /v1/jobs/:id/ackJob succeeded, it's done(none required)
nackPOST /v1/jobs/:id/nackJob failed{ retryable: true | false }
deferPOST /v1/jobs/:id/deferBackpressure — redeliver later, no attempt burned{ retryAfter: <seconds> }

The retryable flag on /nack is load-bearing. retryable: true sends the job back through your queue's backoff policy for another attempt. retryable: false is a permanent failure — the job goes straight to the dead-letter queue without wasting more attempts on something that won't succeed (a malformed payload, a 400 from upstream, a content-policy rejection). For more on getting retry classification right, see why job retries matter.

Defer is not failure

If your downstream returns a 429, 503, or 529 with a Retry-After, don't nack it — defer it. POST /defer with retryAfter set to the upstream's wait, and SimpleQ redelivers the job later without burning an attempt. A job can ride out a sustained rate limit and still complete on its maxAttempts budget. This is covered in depth in handling 429, 503, and 529 backpressure.

ackTimeout and ackTimeoutAction

Once you've returned 200, SimpleQ can no longer tell whether your worker is busy or dead. So ack mode adds a deadline: ackTimeout. It's the maximum time the queue will wait for one of the three signals after delivery. If you call /ack, /nack, or /defer before it elapses, everything proceeds normally. If you don't, the worker is presumed dead and ackTimeoutAction takes over.

SettingWhat it controlsTypical value
ackTimeoutHow long to wait for an outcome after delivery300-600s, sized to your worst-case run
ackTimeoutAction = retryOn timeout, redeliver the job (counts like a retry)Default — survives worker crashes
ackTimeoutAction = deadOn timeout, send straight to the DLQWhen a stuck job should never auto-retry

Size ackTimeout to your realistic worst case, not your average. If a generation usually takes 40 seconds but a long one can hit four minutes, a 300-second ackTimeout gives you headroom; setting it to 60 seconds would redeliver healthy long-running jobs and double your work. The two built-in templates encode sensible defaults: the anthropic template uses a 600-second ackTimeout (Claude generations can be long), and the openai template uses 300 seconds.

Templates are starting points

Creating a queue from the anthropic or openai template wires up ack mode with a sensible ackTimeout, retry backoff, and rate-limit defaults for that provider. Override any field at queue-creation time — the template just saves you from setting six knobs by hand.

Idempotency for redelivered work

Ack mode introduces a possibility standard mode never had: the same job can run more than once. If your endpoint returns 200, starts a 90-second generation, and then the process crashes before calling /ack, the ackTimeout eventually fires and (with the default redeliver action) the job is delivered again to a fresh worker. That's exactly the durability you want — but it means the expensive work could run twice unless your handler is idempotent.

Two layers protect you, and you want both:

  • Publish-boundary idempotency. Pass an idempotencyKey when you publish the job. SimpleQ dedupes publishes that share a key, so a double-publish (your own retry of the enqueue call) doesn't create two jobs.
  • Handler-side idempotency. Make the work itself safe to re-run. Before starting, check whether a result already exists keyed by the job id; if it does, skip the work and just /ack. The job id is stable across redeliveries, so it's a natural idempotency key for the result.

Here's the shape of an ack-mode handler that's safe under redelivery. It acknowledges receipt immediately, runs the work in the background, and reports the real outcome — defer on backpressure, nack with retryable on failure, ack on success:

app/api/worker/route.ts
ts
1import { verifySignature } from "@simpleq/sdk";
2 
3const SIMPLEQ = "https://api.simpleq.io";
4const headers = {
5 Authorization: `Bearer ${process.env.SIMPLEQ_KEY}`,
6 "Content-Type": "application/json",
7};
8 
9export async function POST(req: Request) {
10 const raw = await req.text();
11 // Verify HMAC-SHA256 over the raw body before trusting anything.
12 verifySignature(raw, req.headers.get("x-simpleq-signature"), process.env.SIMPLEQ_SECRET!);
13 
14 const { id, payload } = JSON.parse(raw);
15 
16 // Ack mode: return 200 immediately, do the slow work out of band.
17 process.nextTick(() => runJob(id, payload));
18 return new Response("ok", { status: 200 });
19}
20 
21async function runJob(id: string, payload: any) {
22 // Handler-side idempotency: never redo finished work after a redelivery.
23 if (await resultExists(id)) {
24 await fetch(`${SIMPLEQ}/v1/jobs/${id}/ack`, { method: "POST", headers });
25 return;
26 }
27 
28 try {
29 const res = await callUpstream(payload); // may run for minutes
30 
31 if (res.status === 429 || res.status === 503 || res.status === 529) {
32 // Backpressure: defer, don't fail. No attempt is burned.
33 const retryAfter = Number(res.headers.get("retry-after") ?? 30);
34 await fetch(`${SIMPLEQ}/v1/jobs/${id}/defer`, {
35 method: "POST", headers, body: JSON.stringify({ retryAfter }),
36 });
37 return;
38 }
39 
40 await saveResult(id, await res.json());
41 await fetch(`${SIMPLEQ}/v1/jobs/${id}/ack`, { method: "POST", headers });
42 } catch (err) {
43 const retryable = isTransient(err); // 5xx / network → true; 400 / policy → false
44 await fetch(`${SIMPLEQ}/v1/jobs/${id}/nack`, {
45 method: "POST", headers, body: JSON.stringify({ retryable }),
46 });
47 }
48}

The resultExists check is what makes a redelivered job cheap instead of a duplicate generation. Persist the result under the job id the moment the work completes, and the second delivery short-circuits to a clean /ack.

Standard mode vs ack mode: choosing

Ack mode is more capable, but it's also more moving parts — a second HTTP call, an ackTimeout to size, idempotency to get right. Don't reach for it when standard mode would do. The deciding question is simple: can the work reliably finish inside the 15-second webhook timeout?

WorkloadModeWhy
Fast DB write, cache warm, fan-out enqueueStandardCompletes well under 15s; the synchronous 200/non-2xx contract is enough
Small synchronous API callStandardPredictably fast; no need for a second call
LLM generation (Claude, OpenAI)AckOften exceeds 15s; report outcome when the completion lands
Video transcode / renderAckRuns for minutes; must report out of band
Slow partner API / document pipelineAckUpstream latency is unpredictable and can exceed the timeout

A practical rule: start in standard mode. If you see jobs failing on timeout — not on logic errors, but on the 15-second ceiling — that's the signal to move that queue to ack mode. Both modes share the same retry, backoff, rate-limit, and dead-letter machinery underneath; only the delivery contract changes.

Setting up an ack-mode queue

Create the queue from the anthropic template (which sets ack mode plus a 600-second ackTimeout for you) or specify the ack-mode fields explicitly. Then publish jobs to it exactly as you would any other queue:

create-and-publish.sh
bash
1# Create an ack-mode queue from the anthropic template (600s ackTimeout).
2curl -X POST https://api.simpleq.io/v1/queues \
3 -H "Authorization: Bearer sq_live_..." \
4 -H "Content-Type: application/json" \
5 -d '{
6 "name": "long-generations",
7 "template": "anthropic",
8 "webhookUrl": "https://your-app.com/api/worker",
9 "ackTimeoutAction": "retry"
10 }'
11 
12# Publish a job. idempotencyKey dedupes the publish; the work runs out of band.
13curl -X POST https://api.simpleq.io/v1/queues/long-generations/jobs \
14 -H "Authorization: Bearer sq_live_..." \
15 -H "Content-Type: application/json" \
16 -d '{
17 "payload": {
18 "model": "claude-sonnet-4-6",
19 "messages": [{ "role": "user", "content": "Write a detailed report on..." }]
20 },
21 "idempotencyKey": "report_user-123_abc"
22 }'

The official TypeScript SDK — @simpleq/sdk on npm — wraps all of this, including signature verification and the ack/nack/defer calls. The API is HTTP-first underneath, so any language that can make a request and POST back an outcome works the same way. You can always inspect a job's state with GET /v1/jobs/:id while you're wiring things up.

Summary

The 15-second webhook timeout is a feature, not a bug — it keeps fast work honest. Ack mode is the escape hatch for the work that genuinely can't fit: confirm receipt with a fast 200, run the slow work out of band, then report the real outcome with /ack, /nack (with the retryable flag), or /defer (with retryAfter). Size ackTimeout to your worst case, pick ackTimeoutAction deliberately, and make handlers idempotent so a redelivered job is cheap rather than duplicated.

If you'd rather not build the receipt-then-report plumbing, the dead-worker detection, and the redelivery loop yourself, SimpleQ gives you ack mode out of the box — durable acceptance, three-signal acks, ackTimeout-based crash recovery, idempotent publishes, and a dead-letter queue with replay. See the ack-mode processing use case for an end-to-end example.

Frequently asked questions

Use ack mode. Your endpoint returns 200 fast to confirm it received the job, then keeps working in the background. When the work finishes — which can be minutes later — you POST the real outcome to /v1/jobs/:id/ack, /nack, or /defer. The delivery no longer has to complete inside the synchronous request, so the ~15-second HTTP ceiling stops mattering.
Try SimpleQ

Ship reliable async work in minutes.

Free tier covers 10,000 job executions a month. No credit card.