Webhook delivery: the reliability gap between sending and delivered
A deep look at the reliability stack underneath webhook delivery, including transactional outbox, delivery workers, jittered backoff, circuit breakers, dead letter queues, HMAC signing, idempotency contracts, fan out, and the observability layer that makes all of it debuggable.
A webhook feels deceptively small. You fire a POST, the other side receives it, done. The problem is "the other side" is a server you don't control, running on infrastructure you can't see, potentially owned by a customer who hasn't paid their cloud bill. It will go down. It will time out. It will return 500 during their deployment window at 2 AM.
Most teams start with a naive implementation. They call the endpoint synchronously, retry once if it fails, and then spend the next six months adding reliability after production incidents. This is the system you would build if you thought about reliability first.
The failure modes you're actually designing against
Before designing anything, it helps to enumerate every way a single delivery attempt can fail, because each failure mode requires a different response.
Connection level failures are the cleanest. DNS resolution fails, the TCP handshake times out, or TLS negotiation errors. No data was sent, so it is safe to retry immediately.
Request level failures happen after connection. The endpoint accepts the connection but hangs without sending a response. This is the most dangerous case because you cannot tell if the payload was received and is being processed, or if the server crashed while reading the body. Your timeout fires eventually, but you are left uncertain.
Response level failures split into two categories that need different handling. 4xx responses mean the payload is wrong in a way the endpoint can describe, such as malformed JSON, a missing required field, or a changed endpoint URL. Retrying a 4xx is pointless and wastes resources. 5xx responses mean their system is having a problem that may be transient, such as database overload, an ongoing deploy, or memory pressure. These are safe to retry with backoff.
Semantic failures are the trickiest. The endpoint returns 200 OK, your worker marks the delivery as successful, and somewhere downstream their processing fails silently. From your perspective the delivery succeeded. From theirs it did not. This is why idempotency is the consumer's responsibility, not yours. You cannot know what happens after the 200.
Success with duplicate risk appears when your worker crashes after the endpoint processes the request but before your worker marks it as delivered. The queue redelivers. The consumer gets the same event twice. This is at least once delivery in action. You can guarantee delivery, but you cannot guarantee exactly once processing.
The transactional outbox: never lose an event
The most common first mistake is enqueuing the webhook event as part of the same operation that commits the business transaction, but doing both independently:
// In your order handler
await db.orders.insert(order) // commits to DB
await queue.send("webhook", { order }) // if this crashes, the event is lostIf the process dies between those two lines, the order exists in the database but the webhook was never enqueued. It's silently lost. The customer's integration never fires.
The fix is the transactional outbox pattern. Instead of writing directly to the queue, you write the webhook event to an outbox table in the same database transaction as the business entity. A separate process called the outbox relay reads unprocessed rows, publishes them to the queue, and then marks them processed.
// In your order handler — one atomic transaction
await db.transaction(async (tx) => {
await tx.orders.insert(order)
await tx.outbox.insert({
id: crypto.randomUUID(),
topic: "order.paid",
payload: JSON.stringify({ orderId: order.id, amount: order.total }),
created_at: new Date(),
published: false,
})
})The relay runs on a short interval and works as a simple loop: poll, publish, acknowledge.
async function relay() {
const rows = await db
.selectFrom("outbox")
.where("published", "=", false)
.orderBy("created_at", "asc")
.limit(100)
.forUpdate()
.skipLocked()
.execute()
for (const row of rows) {
await queue.send("webhook.deliver", { outboxId: row.id })
await db
.updateTable("outbox")
.set({ published: true, published_at: new Date() })
.where("id", "=", row.id)
.execute()
}
}
setInterval(relay, 500)The sequence looks like this:
The cost is a small polling interval and an extra table. The benefit is simple: you do not lose an event to a process crash. The event exists in the database until it is published.
The full delivery pipeline
With ingestion handled, the rest of the system separates cleanly into three concerns: delivery, retry scheduling, and failure handling.
A few things are worth noting in this diagram. Multiple workers read from the same queue concurrently. That is intentional, and the locking strategy inside the worker handles the race conditions. The 4xx path skips the retry loop entirely and goes straight to dead with no further attempts. The DLQ feeds back into the main queue only after explicit action, either from a customer triggering a retry in their dashboard or from an operator running a replay script.
Event state machine
Every webhook event moves through a defined set of states. Modeling these states explicitly in the database is not overhead. It is what makes debugging and observability questions answerable.
The distinction between failed_temp and failed_perm is load-bearing. Without it, your retry queue fills with 4xx events that will never succeed, crowding out real transient failures. Most implementations collapse these into a single failed state and then wonder why their retry queue is backed up with 12,000 events, all of which have attempt counts at the maximum.
The scheduled state is also worth calling out explicitly. Without it you can't query "how many events are waiting for their next retry attempt" — which is a dashboard metric you want before an outage, not after.
The delivery worker in detail
The worker is where the actual HTTP call happens, and it needs to handle more edge cases than the happy path suggests.
import { createHmac } from "node:crypto"
import { computeDelay } from "@/lib/backoff"
import { db } from "@/lib/db"
import { sign } from "@/lib/signing"
const MAX_ATTEMPTS = 7
const REQUEST_TIMEOUT_MS = 30_000
export async function processEvent(eventId: string): Promise<void> {
// Acquire exclusive lock — prevents concurrent workers from double-delivering
const event = await db.transaction(async (tx) => {
const row = await tx
.selectFrom("webhook_events as e")
.innerJoin("webhook_endpoints as ep", "ep.id", "e.endpoint_id")
.selectAll()
.where("e.id", "=", eventId)
.where("e.status", "in", ["pending", "scheduled"])
.forUpdate()
.skipLocked()
.executeTakeFirst()
if (!row) return null // another worker already has it
await tx
.updateTable("webhook_events")
.set({ status: "delivering", last_attempt_at: new Date() })
.where("id", "=", eventId)
.execute()
return row
})
if (!event) return
const result = await attempt(event)
await record(event, result)
}
async function attempt(event: WebhookEvent): Promise<AttemptResult> {
const body = JSON.stringify(event.payload)
const timestamp = Math.floor(Date.now() / 1000).toString()
const signature = sign(`${timestamp}.${body}`, event.endpoint_secret)
const controller = new AbortController()
const timeoutId = setTimeout(() => controller.abort(), REQUEST_TIMEOUT_MS)
const start = performance.now()
try {
const res = await fetch(event.endpoint_url, {
method: "POST",
headers: {
"Content-Type": "application/json",
"Webhook-Id": event.id,
"Webhook-Timestamp": timestamp,
"Webhook-Signature": `v1=${signature}`,
},
body,
signal: controller.signal,
})
const duration = Math.round(performance.now() - start)
const responseBody = await res.text().catch(() => null)
return {
success: res.ok,
statusCode: res.status,
responseBody,
duration,
error: null,
}
} catch (err) {
const duration = Math.round(performance.now() - start)
const isTimeout = err instanceof Error && err.name === "AbortError"
return {
success: false,
statusCode: null,
responseBody: null,
duration,
error: isTimeout ? "timeout" : String(err),
}
} finally {
clearTimeout(timeoutId)
}
}
async function record(event: WebhookEvent, result: AttemptResult): Promise<void> {
const nextAttempt = event.attempt_count + 1
const isPermanentFailure =
result.statusCode !== null && result.statusCode >= 400 && result.statusCode < 500
let nextStatus: EventStatus
let nextAttemptAt: Date | null = null
if (result.success) {
nextStatus = "delivered"
} else if (isPermanentFailure) {
nextStatus = "dead"
} else if (nextAttempt >= MAX_ATTEMPTS) {
nextStatus = "dead"
} else {
nextStatus = "scheduled"
nextAttemptAt = new Date(Date.now() + computeDelay(nextAttempt))
}
await db.transaction(async (tx) => {
await tx
.updateTable("webhook_events")
.set({
status: nextStatus,
attempt_count: nextAttempt,
next_attempt_at: nextAttemptAt,
})
.where("id", "=", event.id)
.execute()
await tx
.insertInto("delivery_attempts")
.values({
id: crypto.randomUUID(),
event_id: event.id,
attempt_number: nextAttempt,
response_status: result.statusCode,
response_body: result.responseBody?.slice(0, 4096) ?? null,
error_message: result.error,
duration_ms: result.duration,
attempted_at: new Date(),
})
.execute()
// Move to DLQ if dead
if (nextStatus === "dead") {
await tx
.insertInto("dead_letter_events")
.values({
id: crypto.randomUUID(),
event_id: event.id,
reason: isPermanentFailure ? "permanent_failure" : "max_attempts_exceeded",
created_at: new Date(),
})
.execute()
}
})
// Schedule next attempt via queue
if (nextStatus === "scheduled" && nextAttemptAt) {
await queue.sendDelayed("webhook.deliver", { eventId: event.id }, nextAttemptAt)
}
}The 30-second request timeout deserves a comment. The real failure mode is not a slow response — it's a response that never comes. Without the AbortController, your worker hangs forever, consuming a connection and blocking that worker thread from processing other events. 30 seconds is generous; most real integrations respond in under 5 seconds. But a timeout that fires too early causes more spurious retries than it saves.
Retry strategies: the comparison you need
There are four backoff strategies in common use. Understanding why each one falls short of the next is more useful than just knowing the formula for exponential backoff.
| Strategy | Formula | Problem |
|---|---|---|
| No backoff | retry immediately | Hammers a struggling endpoint, causing cascading failure |
| Constant delay | delay = N | Synchronises retry storms — all workers retry at the same instant |
| Linear backoff | delay = N × attempt | Better spread, but retries arrive too fast early and too slow late |
| Exponential backoff | delay = base × 2^attempt | Good shape, but still synchronises workers at each boundary |
| Exponential + jitter | delay = rand(0, base × 2^attempt) | Each worker draws a random delay — no synchronisation |
The jump from exponential to exponential-with-jitter is the one that matters at scale. Without jitter, you get the thundering herd problem: every worker whose event failed at t=0 retries at t=1s, then at t=3s, then at t=7s. Each boundary is a spike. The endpoint comes back online and gets hit by all of them simultaneously, which may knock it back down.
With full jitter, each worker independently draws a random delay from [0, min(cap, base × 2^attempt)]. The same 400 workers that all failed at t=0 retry uniformly distributed across the first second. No spikes.
The math:
With base = 1s and cap = 3600s, the maximum possible delays at each attempt are:
The actual delay is a random draw from [0, max] at each step — so an attempt 6 event might be scheduled anywhere from immediately to an hour from now. This is the right behavior: a customer whose endpoint has been down for 30 minutes doesn't need another attempt in 32 seconds. They need one in an hour when the on-call engineer wakes up.
const BASE_MS = 1_000
const CAP_MS = 3_600_000 // 1 hour
/**
* Full jitter exponential backoff.
* attempt is zero-indexed — attempt 0 is the first retry after failure.
*/
export function computeDelay(attempt: number): number {
const ceiling = Math.min(CAP_MS, BASE_MS * Math.pow(2, attempt))
return Math.floor(Math.random() * ceiling)
}
/**
* Decorrelated jitter — an alternative that tends to produce
* slightly higher average delays, better for high-contention systems.
*/
export function computeDecorrelatedDelay(attempt: number, prevDelay: number): number {
const min = BASE_MS
const max = prevDelay * 3
return Math.floor(Math.random() * (max - min) + min)
}The retry sequence for a 7-attempt budget looks like this, using the capped values as upper bounds:
Circuit breaker per endpoint
Exponential backoff handles individual event failures. It doesn't handle the case where an entire endpoint is down. If a customer's server goes offline and they have 10,000 events in the queue, every single one of those events will attempt delivery, fail, and schedule a retry. Your queue fills up with their failures. Other customers' events slow down.
The fix is a circuit breaker at the endpoint level. After a threshold of consecutive failures, you open the circuit — temporarily pause all delivery to that endpoint — and stop processing their events until the circuit closes again.
const FAILURE_THRESHOLD = 5
const PROBE_INTERVAL_MS = 60_000 // try again after 1 minute
export async function getCircuitState(endpointId: string): Promise<CircuitState> {
const endpoint = await db
.selectFrom("webhook_endpoints")
.select(["circuit_state", "consecutive_failures", "circuit_opened_at"])
.where("id", "=", endpointId)
.executeTakeFirstOrThrow()
if (endpoint.circuit_state === "open") {
const elapsed = Date.now() - endpoint.circuit_opened_at!.getTime()
if (elapsed >= PROBE_INTERVAL_MS) {
await db
.updateTable("webhook_endpoints")
.set({ circuit_state: "half_open" })
.where("id", "=", endpointId)
.execute()
return "half_open"
}
}
return endpoint.circuit_state
}
export async function recordOutcome(endpointId: string, success: boolean): Promise<void> {
if (success) {
await db
.updateTable("webhook_endpoints")
.set({
circuit_state: "closed",
consecutive_failures: 0,
circuit_opened_at: null,
})
.where("id", "=", endpointId)
.execute()
return
}
const endpoint = await db
.selectFrom("webhook_endpoints")
.select(["consecutive_failures"])
.where("id", "=", endpointId)
.executeTakeFirstOrThrow()
const newCount = endpoint.consecutive_failures + 1
await db
.updateTable("webhook_endpoints")
.set({
consecutive_failures: newCount,
circuit_state: newCount >= FAILURE_THRESHOLD ? "open" : "closed",
circuit_opened_at: newCount >= FAILURE_THRESHOLD ? new Date() : null,
})
.where("id", "=", endpointId)
.execute()
}When the circuit is open, the worker skips delivery and requeues the event without incrementing the attempt counter — the event shouldn't burn through its retry budget while the endpoint is simply down. When the circuit transitions to half_open, one probe attempt goes through. If it succeeds, the circuit closes and all queued events resume delivery.
Fan-out: one event to multiple endpoints
A single business event — order.paid — might need to go to multiple customer endpoints. A customer may have registered for both a production URL and a staging URL. A platform might have multiple subscribers to the same event type.
The fan-out happens at the event level, not the delivery level. One canonical event record spawns multiple delivery records — one per endpoint.
export async function fanOut(outboxId: string): Promise<void> {
const outbox = await db.selectFrom("outbox").where("id", "=", outboxId).executeTakeFirstOrThrow()
// Find all endpoints subscribed to this topic
const endpoints = await db
.selectFrom("webhook_endpoints")
.where("topic", "=", outbox.topic)
.where("status", "=", "active")
.execute()
// Create one delivery record per endpoint
const events = endpoints.map((ep) => ({
id: crypto.randomUUID(),
endpoint_id: ep.id,
outbox_id: outbox.id,
topic: outbox.topic,
payload: outbox.payload,
status: "pending" as const,
attempt_count: 0,
created_at: new Date(),
}))
if (events.length === 0) return
await db.transaction(async (tx) => {
await tx.insertInto("webhook_events").values(events).execute()
// Enqueue one delivery job per event
for (const event of events) {
await queue.send("webhook.deliver", { eventId: event.id })
}
})
}Each delivery record is independent. Endpoint A failing and going into retry does not affect delivery to endpoint B. The circuit breaker operates per-endpoint, not per-event. A slow consumer doesn't penalise a healthy one.
HMAC signing: the full implementation
Sending a webhook is only half the security story. The receiving end needs to verify that the payload came from you and was not tampered with in transit. HTTPS handles transport encryption. Signing handles payload authenticity.
The signing flow:
import { createHmac, timingSafeEqual } from "node:crypto"
/**
* Sign a webhook payload.
* The signing string includes the timestamp to prevent replay attacks.
*/
export function sign(body: string, timestamp: string, secret: string): string {
const toSign = `${timestamp}.${body}`
return createHmac("sha256", secret).update(toSign).digest("hex")
}
/**
* Verify an incoming webhook signature.
* Returns false if the signature is invalid OR if the timestamp is stale.
*/
export function verify(
body: string,
signature: string,
timestamp: string,
secret: string,
toleranceSeconds = 300 // 5 minutes
): boolean {
// Check timestamp freshness first — cheap operation
const ts = parseInt(timestamp, 10)
if (isNaN(ts)) return false
const age = Math.floor(Date.now() / 1000) - ts
if (Math.abs(age) > toleranceSeconds) return false
// Compute expected and compare in constant time
const expected = sign(body, timestamp, secret)
try {
return timingSafeEqual(
Buffer.from(expected, "hex"),
Buffer.from(signature.replace(/^v1=/, ""), "hex")
)
} catch {
// Buffer lengths differ — invalid signature format
return false
}
}Two things here that are easy to get wrong.
timingSafeEqual is not optional. A naive === comparison short-circuits on the first mismatched byte, which creates a timing side-channel. An attacker can make thousands of requests with slightly different signatures and measure response times to brute-force the correct signature one byte at a time. Constant-time comparison eliminates this attack vector entirely.
The timestamp in the signing string prevents replay attacks. Without it, an attacker who captures a valid signed request can replay it indefinitely — the signature will always be valid because nothing in it changes. With a five-minute tolerance window, captured requests expire quickly. Your consumer should also store the Webhook-Id of recently processed events and reject any duplicate IDs within the tolerance window.
app.post("/webhook", async (req, res) => {
const id = req.headers["webhook-id"] as string
const timestamp = req.headers["webhook-timestamp"] as string
const signature = req.headers["webhook-signature"] as string
const body = req.rawBody // must be the raw string, not parsed JSON
// Verify signature
const valid = verify(body, signature, timestamp, process.env.WEBHOOK_SECRET!)
if (!valid) return res.status(401).send("Invalid signature")
// Idempotency check — have we processed this event id before?
const seen = await redis.set(`webhook:seen:${id}`, "1", "EX", 86400, "NX")
if (!seen) return res.status(200).send("Already processed") // not 4xx — return 200
const event = JSON.parse(body)
await processEvent(event)
res.status(200).send("OK")
})Note the idempotency check returns 200, not 409. If your consumer returns 4xx on a duplicate, your delivery system will log a permanent failure and move the event to the DLQ — but the event was actually processed successfully the first time. Always 200 on duplicates.
Ordering guarantees: what you can and cannot promise
Webhooks are inherently out-of-order. With multiple workers reading from the same queue, there is no guarantee that order.created arrives before order.paid at the consumer. Event A can be delayed by two retry cycles while event B is delivered immediately.
Most webhook systems offer no ordering guarantee and document this clearly: "Events may arrive out of order. Use timestamps to reconstruct sequence."
If ordering matters for specific event types, the only reliable mechanism is per-entity FIFO queues — a separate queue per customer per entity (e.g., per orderId). A single worker processes each queue sequentially. This is the approach AWS SQS FIFO queues take.
The tradeoff is operational complexity. Per-entity queues mean potentially thousands of queues to manage. Most teams decide that ordering guarantees are not worth that complexity, document the lack of ordering explicitly, and provide consumers with event timestamps so they can reconstruct sequence on their side.
Dead-letter queues as a product surface
The DLQ is usually treated as an ops concern — a pile of failed events for engineering to dig through. The better mental model is treating it as a product surface that your customers interact with.
A customer's webhook handler was broken for two days. It's fixed now. They want to replay all the events they missed. Without a good DLQ story, they can't — they need to manually trigger the business actions from scratch, which may not be possible.
A customer-facing DLQ dashboard changes this completely:
// GET /api/webhooks/dead-letter?endpointId=ep_xxx&page=1
export async function listDeadLetterEvents(
endpointId: string,
page: number
): Promise<DeadLetterPage> {
const events = await db
.selectFrom("webhook_events as e")
.innerJoin("dead_letter_events as dl", "dl.event_id", "e.id")
.where("e.endpoint_id", "=", endpointId)
.where("e.status", "=", "dead")
.select(["e.id", "e.topic", "e.created_at", "dl.reason", "dl.created_at as failed_at"])
.orderBy("dl.created_at", "desc")
.limit(50)
.offset((page - 1) * 50)
.execute()
return { events, page }
}
// POST /api/webhooks/dead-letter/:eventId/retry
export async function retryDeadLetterEvent(eventId: string): Promise<void> {
await db.transaction(async (tx) => {
await tx
.updateTable("webhook_events")
.set({
status: "pending",
attempt_count: 0,
next_attempt_at: null,
})
.where("id", "=", eventId)
.where("status", "=", "dead")
.execute()
await tx.deleteFrom("dead_letter_events").where("event_id", "=", eventId).execute()
})
await queue.send("webhook.deliver", { eventId })
}The DLQ should store not just the final state but the full delivery history per event. Every attempt — status code, response body, error message, latency, timestamp. When a customer asks "why did event evt_abc123 fail?", you have the complete picture.
Full database schema
Everything discussed above maps to a concrete schema. The tables work together to give you full audit history, circuit breaker state, and DLQ management.
-- Outbox: transactional event staging
CREATE TABLE outbox (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
topic TEXT NOT NULL,
payload JSONB NOT NULL,
published BOOLEAN NOT NULL DEFAULT false,
published_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_outbox_unpublished ON outbox (created_at)
WHERE published = false;
-- Endpoints: registered webhook URLs with circuit breaker state
CREATE TABLE webhook_endpoints (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES customers(id),
url TEXT NOT NULL,
secret TEXT NOT NULL,
topic TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'active' CHECK (status IN ('active', 'paused', 'deleted')),
circuit_state TEXT NOT NULL DEFAULT 'closed' CHECK (circuit_state IN ('closed', 'open', 'half_open')),
consecutive_failures INTEGER NOT NULL DEFAULT 0,
circuit_opened_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Events: one per outbox row per subscribed endpoint
CREATE TABLE webhook_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
endpoint_id UUID NOT NULL REFERENCES webhook_endpoints(id),
outbox_id UUID NOT NULL REFERENCES outbox(id),
topic TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL DEFAULT 'pending' CHECK (
status IN ('pending', 'delivering', 'delivered', 'scheduled', 'dead')
),
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at TIMESTAMPTZ,
next_attempt_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_events_scheduled ON webhook_events (next_attempt_at)
WHERE status = 'scheduled';
CREATE INDEX idx_events_pending ON webhook_events (created_at)
WHERE status = 'pending';
-- Delivery attempts: full history per event
CREATE TABLE delivery_attempts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id UUID NOT NULL REFERENCES webhook_events(id),
attempt_number INTEGER NOT NULL,
response_status INTEGER,
response_body TEXT,
error_message TEXT,
duration_ms INTEGER NOT NULL,
attempted_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_attempts_event ON delivery_attempts (event_id, attempt_number);
-- Dead-letter queue
CREATE TABLE dead_letter_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id UUID NOT NULL REFERENCES webhook_events(id),
reason TEXT NOT NULL CHECK (reason IN ('permanent_failure', 'max_attempts_exceeded')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL DEFAULT now() + INTERVAL '30 days'
);
-- Cleanup job: run daily
CREATE INDEX idx_dlq_expires ON dead_letter_events (expires_at);The partial indexes on webhook_events are worth the attention. The scheduler that picks up events for delivery only cares about rows in pending or scheduled state — indexing the full table would waste write overhead on the delivered majority. Partial indexes target exactly the rows that matter.
Concurrency model
How many workers should you run? The answer depends on your throughput requirements and your acceptable delivery latency.
A single worker processes events sequentially. If a delivery takes 5 seconds (slow endpoint) and you have 100 events queued, you're looking at 500 seconds of latency for the last event. The fix is horizontal scaling — more worker processes, each reading from the same queue, using SKIP LOCKED to avoid contention.
const CONCURRENCY = parseInt(process.env.WEBHOOK_CONCURRENCY ?? "10")
async function runPool() {
const workers = Array.from({ length: CONCURRENCY }, (_, i) => runWorker(i))
await Promise.allSettled(workers)
}
async function runWorker(id: number) {
console.log(`[worker:${id}] starting`)
while (true) {
const job = await queue.receive("webhook.deliver", { visibilityTimeout: 60 })
if (!job) {
await sleep(1000)
continue
}
try {
await processEvent(job.payload.eventId)
await queue.ack(job)
} catch (err) {
console.error(`[worker:${id}] unhandled error`, err)
await queue.nack(job) // return to queue
}
}
}The visibilityTimeout is the hidden concurrency control. When a worker receives a message, the queue hides it from other workers for 60 seconds. If the worker doesn't acknowledge within that window, the queue makes the message visible again — another worker can pick it up. Set the timeout to at least your request timeout plus processing overhead, otherwise the queue redelivers messages that are still in-flight.
Observability: what to measure and alert on
A webhook system with no observability is a black box that will surprise you in production. The metrics that matter are specific to the delivery lifecycle.
| Metric | Query | Alert threshold |
|---|---|---|
| Pending + scheduled queue depth | COUNT(*) WHERE status IN ('pending', 'scheduled') | Growing consistently over 5min |
| Delivery latency (p95) | Time from created_at to delivered | > 60s for non-retry events |
| DLQ growth rate | New rows in dead_letter_events per minute | > baseline |
| Per-endpoint failure rate | failed / total per endpoint_id per hour | > 50% for 15min |
| Circuit breaker opens | State transitions to open | Any transition |
| Attempt distribution | COUNT(*) GROUP BY attempt_count | High concentration at max attempts |
The attempt distribution histogram is the most underrated metric. A healthy system has almost all events delivered on the first or second attempt. If you see a large spike at attempt 7 (the maximum), that's a signal that a significant proportion of events are burning through their entire retry budget — which means either the cap is too low or some endpoints are systematically unhealthy.
The per-endpoint failure rate surfaces noisy neighbors before they affect everyone else. One customer with a broken endpoint shouldn't slow down delivery for your other 500 customers — but it will if you don't isolate it.
export async function emitDeliveryMetrics() {
const [queueDepth, dlqDepth, attemptDist] = await Promise.all([
db
.selectFrom("webhook_events")
.select(db.fn.count("id").as("count"))
.where("status", "in", ["pending", "scheduled"])
.executeTakeFirst(),
db
.selectFrom("dead_letter_events")
.select(db.fn.count("id").as("count"))
.where("created_at", ">", new Date(Date.now() - 3_600_000))
.executeTakeFirst(),
db
.selectFrom("delivery_attempts")
.select(["attempt_number", db.fn.count("id").as("count")])
.where("attempted_at", ">", new Date(Date.now() - 3_600_000))
.groupBy("attempt_number")
.execute(),
])
metrics.gauge("webhook.queue_depth", Number(queueDepth?.count ?? 0))
metrics.gauge("webhook.dlq_depth_1h", Number(dlqDepth?.count ?? 0))
for (const row of attemptDist) {
metrics.gauge("webhook.attempt_distribution", Number(row.count), {
attempt: String(row.attempt_number),
})
}
}The idempotency contract you owe your consumers
At-least-once delivery is not a bug. It is a documented guarantee: you will deliver every event, but you may deliver it more than once. The consumer's job is to handle duplicates gracefully.
The contract you need to communicate clearly:
- Every webhook request carries a stable
Webhook-Idheader. This ID is the same across all retry attempts for the same event. - Your consumer should store processed IDs and skip any request whose ID was already processed.
- A
200response on a duplicate is correct. Do not return4xx— that signals a permanent failure and moves the event to the DLQ. - Events may arrive out of order. Use
created_attimestamps in the payload to reconstruct sequence.
Document this on your developer dashboard, not buried in a footnote. The most common integration bug is a consumer that processes duplicates and charges a customer twice, then blames your webhook system. The idempotency key was there. They didn't use it.
What the system looks like end to end
Putting everything together, a production-ready webhook delivery system has seven distinct layers, each with a clear responsibility:
Each layer is independently scalable. The outbox relay can run as a single process. The fan-out worker can run as a single process. The delivery worker pool scales horizontally. The circuit breaker state lives in the database and is shared across all worker instances automatically.