Home Posts Webhook HA Architecture with Dead Letter Queues [2026]
System Architecture

Webhook HA Architecture with Dead Letter Queues [2026]

Webhook HA Architecture with Dead Letter Queues [2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 07, 2026 · 8 min read

Bottom Line

High-availability webhook systems do not depend on the destination being healthy right now. They stay available by acknowledging quickly, queuing durably, retrying safely, and isolating poison messages in a dead letter queue.

Key Takeaways

  • Return <2xx> only after persistence; fast ACKs protect your ingress path
  • Set SQS maxReceiveCount high enough for transient failures, not at 1
  • Keep DLQ retention longer than the source queue retention for standard queues
  • Use idempotency keys so retries and replays do not create duplicate side effects

Webhook delivery is a distributed systems problem disguised as an HTTP POST. If your architecture waits on downstream availability, a slow or failing subscriber can take out your ingestion path. The safer design is to accept the event quickly, persist it durably, hand it to a queue, and let delivery workers retry independently. In this tutorial, you will build that pattern with Node.js and Amazon SQS dead letter queues, plus the operational checks that keep it stable under failure.

  • ACK only after the event is durably stored or queued.
  • Use a source queue for normal retries and a DLQ for poison messages.
  • Make delivery idempotent with a stable event ID per attempt group.
  • Monitor backlog, age, and DLQ depth before customers report misses.

Step 1: ACK fast and persist the event

Bottom Line

High availability comes from separating webhook intake from webhook delivery. Your public endpoint should verify, persist, enqueue, and return quickly; retries and failures belong in asynchronous workers and the DLQ.

Prerequisites

  • A Node.js service with Express.
  • An AWS account with two SQS standard queues: one source queue and one DLQ.
  • A subscriber table or config store that maps a tenant to a destination URL and signing secret.
  • Basic familiarity with HMAC verification, retryable HTTP errors, and CloudWatch metrics.
  • Official references: SQS dead-letter queues, DLQ retention guidance, and Lambda with SQS.

The first reliability rule is simple: do not do the outbound delivery work on the same request thread that accepts the webhook. Your ingress service should complete four actions and stop:

  1. Verify the sender signature.
  2. Assign a stable eventId for idempotency.
  3. Write the delivery job to durable storage or SQS.
  4. Return 202 immediately.

This keeps the public endpoint healthy even when subscriber systems are timing out. The code below uses Express and the AWS SDK to enqueue a delivery job. Before you paste JSON policy blobs into your infra repo, run them through TechBytes Code Formatter so escaping errors do not turn queue setup into a deploy-time surprise.

import express from 'express';
import crypto from 'node:crypto';
import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';

const app = express();
app.use(express.json({ verify: (req, _res, buf) => { req.rawBody = buf; } }));

const sqs = new SQSClient({ region: process.env.AWS_REGION });
const queueUrl = process.env.WEBHOOK_QUEUE_URL;

function verifySignature(rawBody, signature, secret) {
  const expected = crypto
    .createHmac('sha256', secret)
    .update(rawBody)
    .digest('hex');

  return crypto.timingSafeEqual(Buffer.from(signature || ''), Buffer.from(expected));
}

app.post('/webhooks/:tenantId', async (req, res) => {
  const tenantId = req.params.tenantId;
  const signature = req.get('x-signature');
  const secret = process.env.INGRESS_SIGNING_SECRET;

  if (!verifySignature(req.rawBody, signature, secret)) {
    return res.status(401).json({ error: 'invalid signature' });
  }

  const eventId = crypto.randomUUID();
  const message = {
    eventId,
    tenantId,
    eventType: req.body.type,
    payload: req.body,
    receivedAt: new Date().toISOString()
  };

  await sqs.send(new SendMessageCommand({
    QueueUrl: queueUrl,
    MessageBody: JSON.stringify(message)
  }));

  return res.status(202).json({ accepted: true, eventId });
});

app.listen(3000);
Watch out: If you return 200 before durable persistence, you have acknowledged an event you may never deliver. That is not high availability; it is silent loss.

Step 2: Create the source queue and DLQ

Amazon SQS DLQs work by attaching a redrive policy to the source queue. AWS defines maxReceiveCount as the number of receives before SQS moves the message to the DLQ. AWS also recommends keeping the source queue and DLQ in the same account and Region for best performance. For standard queues, AWS documents that the message expiration is still based on the original enqueue timestamp, which is why the DLQ retention period should be longer than the source queue retention period.

Create the DLQ first, then the source queue. The source queue below uses long polling, a longer visibility timeout, and a redrive policy that moves a message after repeated failures.

aws sqs create-queue \
  --queue-name webhook-delivery-dlq \
  --attributes MessageRetentionPeriod=1209600

aws sqs create-queue \
  --queue-name webhook-delivery \
  --attributes file://queue-attributes.json
{
  "VisibilityTimeout": "120",
  "ReceiveMessageWaitTimeSeconds": "20",
  "MessageRetentionPeriod": "345600",
  "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:webhook-delivery-dlq\",\"maxReceiveCount\":\"5\"}"
}

Use these settings as a starting point, not dogma:

  • VisibilityTimeout: longer than your normal delivery attempt, so the same message does not reappear while a worker is still processing it.
  • ReceiveMessageWaitTimeSeconds: enables long polling to reduce empty receives.
  • maxReceiveCount: high enough to absorb transient subscriber failures, low enough to quarantine poison messages promptly.
  • DLQ retention: longer than the source queue retention for standard queues, per AWS guidance.

Step 3: Process deliveries with safe retries

Your worker is where availability is won or lost. It should treat delivery as an idempotent operation, include a stable event identifier, and throw on retryable failures so SQS can re-drive the message automatically after the visibility timeout. A typical policy is:

  • Success on any 2xx response.
  • Retry on timeouts, connection errors, and 5xx responses.
  • Usually retry on 429 with backoff.
  • Consider 4xx like 400 or 404 terminal unless your integration contract says otherwise.
import crypto from 'node:crypto';
import { SQSClient, DeleteMessageCommand, ReceiveMessageCommand } from '@aws-sdk/client-sqs';

const sqs = new SQSClient({ region: process.env.AWS_REGION });
const queueUrl = process.env.WEBHOOK_QUEUE_URL;

function signPayload(body, secret) {
  return crypto.createHmac('sha256', secret).update(body).digest('hex');
}

async function deliver(message) {
  const subscriber = await getSubscriberConfig(message.tenantId);
  const body = JSON.stringify(message.payload);

  const response = await fetch(subscriber.url, {
    method: 'POST',
    headers: {
      'content-type': 'application/json',
      'x-event-id': message.eventId,
      'x-webhook-signature': signPayload(body, subscriber.secret)
    },
    body,
    signal: AbortSignal.timeout(8000)
  });

  if (response.status >= 200 && response.status < 300) return;
  if (response.status === 429 || response.status >= 500) {
    throw new Error(`retryable status ${response.status}`);
  }

  throw new Error(`terminal status ${response.status}`);
}

async function poll() {
  const result = await sqs.send(new ReceiveMessageCommand({
    QueueUrl: queueUrl,
    MaxNumberOfMessages: 10,
    WaitTimeSeconds: 20
  }));

  for (const item of result.Messages || []) {
    const message = JSON.parse(item.Body);

    try {
      await deliver(message);
      await markDelivered(message.eventId);
      await sqs.send(new DeleteMessageCommand({
        QueueUrl: queueUrl,
        ReceiptHandle: item.ReceiptHandle
      }));
    } catch (err) {
      await recordAttemptFailure(message.eventId, String(err));
      throw err;
    }
  }
}

setInterval(() => {
  poll().catch(console.error);
}, 1000);
Pro tip: If you run this consumer on AWS Lambda with SQS, enable ReportBatchItemFailures so successful messages in a batch are not retried with the failed ones. AWS documents this as the preferred partial batch response behavior for SQS event sources.
aws lambda update-event-source-mapping \
  --uuid a1b2c3d4-5678-90ab-cdef-11111EXAMPLE \
  --function-response-types ReportBatchItemFailures

Step 4: Verify expected behavior

Now test three paths: the happy path, a transient failure, and a poison message. You want evidence that ingestion stays fast even when delivery is not.

1. Send a test event

curl -X POST http://localhost:3000/webhooks/acme \
  -H 'content-type: application/json' \
  -H 'x-signature: REPLACE_WITH_VALID_SIGNATURE' \
  -d '{"type":"invoice.paid","invoiceId":"inv_123"}'

Expected ingress response:

{
  "accepted": true,
  "eventId": "9d7c0d58-0b0a-4e2a-bb86-3cf77a0f0bde"
}

2. Point the subscriber at a temporary failure endpoint

Expected behavior:

  • The ingress endpoint still returns quickly with 202.
  • The source queue message remains in flight until the visibility timeout expires.
  • The worker logs repeated retryable failures.
  • After maxReceiveCount is exceeded, the message appears in the DLQ.

3. Inspect metrics

  • Source queue backlog should rise during downstream outages, then drain when workers recover.
  • DLQ message count should stay near zero during healthy operation.
  • Message age should remain within your delivery SLO.

Troubleshooting top 3

1. Messages never reach the DLQ

  • Check that the source queue actually has a RedrivePolicy.
  • Confirm the worker is failing the message instead of swallowing exceptions.
  • Verify maxReceiveCount is not set unrealistically high for your test.

2. The same event is delivered twice

  • Assume at-least-once delivery and add idempotency checks in the subscriber.
  • Persist delivery state by eventId before applying side effects.
  • For Lambda + SQS, enable partial batch responses so one bad record does not replay a whole batch.

3. Queue depth grows but workers look healthy

  • Compare worker concurrency to downstream rate limits and response latency.
  • Increase visibility timeout if long-running requests are reappearing too early.
  • Check for slow DNS, TLS, or network egress problems rather than application exceptions alone.

What's next

Once the base pattern is stable, harden it in layers:

  • Add per-tenant rate limiting so one noisy subscriber does not consume all worker capacity.
  • Store attempt history and expose a replay endpoint for operators.
  • Emit structured logs with eventId, tenant, attempt number, latency, and terminal reason.
  • Add circuit breaking for chronically failing endpoints and alert on DLQ depth.
  • If you need stronger ordering semantics, evaluate FIFO carefully; AWS notes that DLQs can break exact order expectations.

The main architectural decision does not change: keep the public webhook edge thin, make delivery asynchronous, and treat the DLQ as an operational workflow rather than a trash can. That is the difference between a webhook system that survives downstream chaos and one that turns every customer outage into your outage.

Frequently Asked Questions

Why do webhook systems need a dead letter queue? +
A dead letter queue isolates messages that keep failing after bounded retries. That keeps poison messages from blocking the main delivery flow and gives operators a safe place to inspect, fix, and replay events.
What should count as a retryable webhook failure? +
Network timeouts, connection resets, 429, and most 5xx responses should usually be retried. Treat terminal 4xx responses carefully, because retrying a bad URL or invalid payload often just burns capacity.
How many retries should I allow before sending a webhook to the DLQ? +
There is no universal number, but maxReceiveCount should be high enough to absorb short downstream incidents and low enough to quarantine poison messages quickly. Start from your subscriber SLOs and failure patterns, then tune from production data rather than guesswork.
Can I guarantee exactly-once webhook delivery with SQS and a DLQ? +
Not in the strict distributed-systems sense. Design for at-least-once delivery and make both the sender and subscriber idempotent with stable event IDs, deduplication checks, and safe replay behavior.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.