Webhook HA Architecture with Dead Letter Queues [2026]
Bottom Line
High-availability webhook systems do not depend on the destination being healthy right now. They stay available by acknowledging quickly, queuing durably, retrying safely, and isolating poison messages in a dead letter queue.
Key Takeaways
- ›Return <2xx> only after persistence; fast ACKs protect your ingress path
- ›Set SQS
maxReceiveCounthigh enough for transient failures, not at1 - ›Keep DLQ retention longer than the source queue retention for standard queues
- ›Use idempotency keys so retries and replays do not create duplicate side effects
Webhook delivery is a distributed systems problem disguised as an HTTP POST. If your architecture waits on downstream availability, a slow or failing subscriber can take out your ingestion path. The safer design is to accept the event quickly, persist it durably, hand it to a queue, and let delivery workers retry independently. In this tutorial, you will build that pattern with Node.js and Amazon SQS dead letter queues, plus the operational checks that keep it stable under failure.
- ACK only after the event is durably stored or queued.
- Use a source queue for normal retries and a DLQ for poison messages.
- Make delivery idempotent with a stable event ID per attempt group.
- Monitor backlog, age, and DLQ depth before customers report misses.
Step 1: ACK fast and persist the event
Bottom Line
High availability comes from separating webhook intake from webhook delivery. Your public endpoint should verify, persist, enqueue, and return quickly; retries and failures belong in asynchronous workers and the DLQ.
Prerequisites
- A Node.js service with Express.
- An AWS account with two SQS standard queues: one source queue and one DLQ.
- A subscriber table or config store that maps a tenant to a destination URL and signing secret.
- Basic familiarity with HMAC verification, retryable HTTP errors, and CloudWatch metrics.
- Official references: SQS dead-letter queues, DLQ retention guidance, and Lambda with SQS.
The first reliability rule is simple: do not do the outbound delivery work on the same request thread that accepts the webhook. Your ingress service should complete four actions and stop:
- Verify the sender signature.
- Assign a stable
eventIdfor idempotency. - Write the delivery job to durable storage or SQS.
- Return
202immediately.
This keeps the public endpoint healthy even when subscriber systems are timing out. The code below uses Express and the AWS SDK to enqueue a delivery job. Before you paste JSON policy blobs into your infra repo, run them through TechBytes Code Formatter so escaping errors do not turn queue setup into a deploy-time surprise.
import express from 'express';
import crypto from 'node:crypto';
import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';
const app = express();
app.use(express.json({ verify: (req, _res, buf) => { req.rawBody = buf; } }));
const sqs = new SQSClient({ region: process.env.AWS_REGION });
const queueUrl = process.env.WEBHOOK_QUEUE_URL;
function verifySignature(rawBody, signature, secret) {
const expected = crypto
.createHmac('sha256', secret)
.update(rawBody)
.digest('hex');
return crypto.timingSafeEqual(Buffer.from(signature || ''), Buffer.from(expected));
}
app.post('/webhooks/:tenantId', async (req, res) => {
const tenantId = req.params.tenantId;
const signature = req.get('x-signature');
const secret = process.env.INGRESS_SIGNING_SECRET;
if (!verifySignature(req.rawBody, signature, secret)) {
return res.status(401).json({ error: 'invalid signature' });
}
const eventId = crypto.randomUUID();
const message = {
eventId,
tenantId,
eventType: req.body.type,
payload: req.body,
receivedAt: new Date().toISOString()
};
await sqs.send(new SendMessageCommand({
QueueUrl: queueUrl,
MessageBody: JSON.stringify(message)
}));
return res.status(202).json({ accepted: true, eventId });
});
app.listen(3000);200 before durable persistence, you have acknowledged an event you may never deliver. That is not high availability; it is silent loss.Step 2: Create the source queue and DLQ
Amazon SQS DLQs work by attaching a redrive policy to the source queue. AWS defines maxReceiveCount as the number of receives before SQS moves the message to the DLQ. AWS also recommends keeping the source queue and DLQ in the same account and Region for best performance. For standard queues, AWS documents that the message expiration is still based on the original enqueue timestamp, which is why the DLQ retention period should be longer than the source queue retention period.
Create the DLQ first, then the source queue. The source queue below uses long polling, a longer visibility timeout, and a redrive policy that moves a message after repeated failures.
aws sqs create-queue \
--queue-name webhook-delivery-dlq \
--attributes MessageRetentionPeriod=1209600
aws sqs create-queue \
--queue-name webhook-delivery \
--attributes file://queue-attributes.json{
"VisibilityTimeout": "120",
"ReceiveMessageWaitTimeSeconds": "20",
"MessageRetentionPeriod": "345600",
"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:webhook-delivery-dlq\",\"maxReceiveCount\":\"5\"}"
}Use these settings as a starting point, not dogma:
- VisibilityTimeout: longer than your normal delivery attempt, so the same message does not reappear while a worker is still processing it.
- ReceiveMessageWaitTimeSeconds: enables long polling to reduce empty receives.
- maxReceiveCount: high enough to absorb transient subscriber failures, low enough to quarantine poison messages promptly.
- DLQ retention: longer than the source queue retention for standard queues, per AWS guidance.
Step 3: Process deliveries with safe retries
Your worker is where availability is won or lost. It should treat delivery as an idempotent operation, include a stable event identifier, and throw on retryable failures so SQS can re-drive the message automatically after the visibility timeout. A typical policy is:
- Success on any
2xxresponse. - Retry on timeouts, connection errors, and
5xxresponses. - Usually retry on
429with backoff. - Consider
4xxlike400or404terminal unless your integration contract says otherwise.
import crypto from 'node:crypto';
import { SQSClient, DeleteMessageCommand, ReceiveMessageCommand } from '@aws-sdk/client-sqs';
const sqs = new SQSClient({ region: process.env.AWS_REGION });
const queueUrl = process.env.WEBHOOK_QUEUE_URL;
function signPayload(body, secret) {
return crypto.createHmac('sha256', secret).update(body).digest('hex');
}
async function deliver(message) {
const subscriber = await getSubscriberConfig(message.tenantId);
const body = JSON.stringify(message.payload);
const response = await fetch(subscriber.url, {
method: 'POST',
headers: {
'content-type': 'application/json',
'x-event-id': message.eventId,
'x-webhook-signature': signPayload(body, subscriber.secret)
},
body,
signal: AbortSignal.timeout(8000)
});
if (response.status >= 200 && response.status < 300) return;
if (response.status === 429 || response.status >= 500) {
throw new Error(`retryable status ${response.status}`);
}
throw new Error(`terminal status ${response.status}`);
}
async function poll() {
const result = await sqs.send(new ReceiveMessageCommand({
QueueUrl: queueUrl,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20
}));
for (const item of result.Messages || []) {
const message = JSON.parse(item.Body);
try {
await deliver(message);
await markDelivered(message.eventId);
await sqs.send(new DeleteMessageCommand({
QueueUrl: queueUrl,
ReceiptHandle: item.ReceiptHandle
}));
} catch (err) {
await recordAttemptFailure(message.eventId, String(err));
throw err;
}
}
}
setInterval(() => {
poll().catch(console.error);
}, 1000);aws lambda update-event-source-mapping \
--uuid a1b2c3d4-5678-90ab-cdef-11111EXAMPLE \
--function-response-types ReportBatchItemFailuresStep 4: Verify expected behavior
Now test three paths: the happy path, a transient failure, and a poison message. You want evidence that ingestion stays fast even when delivery is not.
1. Send a test event
curl -X POST http://localhost:3000/webhooks/acme \
-H 'content-type: application/json' \
-H 'x-signature: REPLACE_WITH_VALID_SIGNATURE' \
-d '{"type":"invoice.paid","invoiceId":"inv_123"}'Expected ingress response:
{
"accepted": true,
"eventId": "9d7c0d58-0b0a-4e2a-bb86-3cf77a0f0bde"
}2. Point the subscriber at a temporary failure endpoint
Expected behavior:
- The ingress endpoint still returns quickly with
202. - The source queue message remains in flight until the visibility timeout expires.
- The worker logs repeated retryable failures.
- After
maxReceiveCountis exceeded, the message appears in the DLQ.
3. Inspect metrics
- Source queue backlog should rise during downstream outages, then drain when workers recover.
- DLQ message count should stay near zero during healthy operation.
- Message age should remain within your delivery SLO.
Troubleshooting top 3
1. Messages never reach the DLQ
- Check that the source queue actually has a
RedrivePolicy. - Confirm the worker is failing the message instead of swallowing exceptions.
- Verify
maxReceiveCountis not set unrealistically high for your test.
2. The same event is delivered twice
- Assume at-least-once delivery and add idempotency checks in the subscriber.
- Persist delivery state by
eventIdbefore applying side effects. - For Lambda + SQS, enable partial batch responses so one bad record does not replay a whole batch.
3. Queue depth grows but workers look healthy
- Compare worker concurrency to downstream rate limits and response latency.
- Increase visibility timeout if long-running requests are reappearing too early.
- Check for slow DNS, TLS, or network egress problems rather than application exceptions alone.
What's next
Once the base pattern is stable, harden it in layers:
- Add per-tenant rate limiting so one noisy subscriber does not consume all worker capacity.
- Store attempt history and expose a replay endpoint for operators.
- Emit structured logs with
eventId, tenant, attempt number, latency, and terminal reason. - Add circuit breaking for chronically failing endpoints and alert on DLQ depth.
- If you need stronger ordering semantics, evaluate FIFO carefully; AWS notes that DLQs can break exact order expectations.
The main architectural decision does not change: keep the public webhook edge thin, make delivery asynchronous, and treat the DLQ as an operational workflow rather than a trash can. That is the difference between a webhook system that survives downstream chaos and one that turns every customer outage into your outage.
Frequently Asked Questions
Why do webhook systems need a dead letter queue? +
What should count as a retryable webhook failure? +
429, and most 5xx responses should usually be retried. Treat terminal 4xx responses carefully, because retrying a bad URL or invalid payload often just burns capacity.How many retries should I allow before sending a webhook to the DLQ? +
Can I guarantee exactly-once webhook delivery with SQS and a DLQ? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.