Architecture · 6 min read

Webhook retry strategy that actually works

Most webhook handlers retry too aggressively or too timidly. Here's the schedule that survives real-world incidents without amplifying them.

Published 2026-04-22 · Last updated 2026-05-01

The default retry strategy in most internal-tooling code looks like: try, sleep 1 second, try again, sleep 2 seconds, give up. That schedule was fine for HTTP calls between two healthy services. It's actively harmful for webhook delivery to a downstream that might be in a deploy or a partial outage.

Why aggressive retries make outages worse

If your downstream is briefly unavailable because it's restarting, hammering it with retries every 1-5 seconds delays its recovery. The downstream's startup is probably I/O bound (loading config, opening Postgres connections). Your retries compete with that startup work. You're also generating more inbound traffic to a stack that's struggling.

Worse: if you have N webhooks waiting to deliver and they're all retrying every 5 seconds, you've created a self-DDOS pattern. The downstream comes up, gets hit with N parallel requests, falls over, gets retried, falls over again.

Why timid retries miss the recovery window

The opposite mistake: retry after 10 minutes, then 1 hour, then give up. You miss the 30-second window during which the downstream actually recovered, and you stop retrying long before the 4-hour incident is over.

The schedule that works

After a few hundred production incidents across different SaaS architectures, the schedule that actually covers the realistic failure modes is exponential backoff with a cap, plus jitter:

Attempt 1: immediate (the original delivery)
Attempt 2: 1 minute (covers 95% of brief restarts)
Attempt 3: 5 minutes (covers most incident-recovery windows)
Attempt 4: 15 minutes (covers DB failovers, long restarts)
Attempt 5: 1 hour (covers slower incidents)
Attempt 6: 4 hours (covers extended outages)
After attempt 6: surface in UI for manual replay

Add ±10% jitter to each interval so simultaneous retries don't synchronize. Cap total attempts at 6; after that, a human should look at the payload and decide whether to retry, edit-and-replay, or write off.

Why this specific schedule

1 minute is long enough that a Phoenix release rolling will be done. 5 minutes covers most pod-eviction and Postgres-failover scenarios. 15 minutes catches longer DB issues. 1 hour and 4 hours cover the slow-burn incident class. Beyond 4 hours you have a real outage that needs human intervention regardless.

Crucially: each retry is acceptable to your downstream because the gaps grow geometrically. You're not amplifying the incident.

Idempotency is non-negotiable

This schedule only works if your downstream handler is idempotent. It is your handler's job to deduplicate based on the event's stable identifier (Stripe event ID, Polar event ID, etc.). Without idempotency, retries cause double-charged customers, duplicate emails, and other unrecoverable side effects.

Pair this retry schedule with: a stable event ID stored in your DB, a uniqueness constraint on it, and a 'have I seen this event before' check at the top of every handler. With those three, retries are safe.

Where this fits in

If you're using Zalapier or Hookdeck or Svix, the retry schedule is built in. If you're rolling your own, 200 lines of Oban + Postgres can implement this faithfully. The schedule above is what Zalapier ships by default.

Skip the implementation work

Zalapier ships this exact retry schedule out of the box. Free for 1,000 events/month.

Start free →