Stripe reliability · 8 min read
Why Stripe webhooks silently fail (and how to fix it)
Stripe's webhook system is solid. The failures that bite SaaS teams aren't on Stripe's side — they're hidden in your stack. Here are the five real-world failure modes, ranked by frequency, with the fix for each.
Published 2026-04-15 · Last updated 2026-05-01
Stripe sends webhooks for every meaningful payment event: checkout.session.completed, invoice.payment_succeeded, customer.subscription.deleted. Their delivery system is good — they retry on 5xx for up to three days, signed with HMAC, with idempotency keys. So why do paid orders periodically arrive with no record in your database?
The honest answer: your stack lost the event during a window Stripe's retries couldn't cover. Below are the five failure modes I see most often in production SaaS, in rough order of frequency.
1. The deploy that took 90 seconds
You ship a release. Phoenix takes 30 seconds to compile, 30 to drain, 30 to start. During those 90 seconds your /webhooks/stripe endpoint returns 502 from your reverse proxy. Stripe sees the 502 and retries — but only after a backoff. If you happened to deploy during a checkout spike, you'll get 4-5 retries arriving over the next half hour while your app is healthy again. The events get processed.
But here's the subtle part: Stripe's retry counter is per-event, not per-endpoint. If a particular event hits its retry limit before your deploy finishes (say, a deploy that gets stuck for 4 hours), that event is gone. For most teams this is rare. For teams that ship 10x a day, it adds up.
2. The Postgres connection pool exhaustion
Your webhook handler returns 200 to Stripe but hangs trying to write to Postgres because the pool is exhausted from a slow report query. Eventually it times out and Stripe's request shows as 504 to your reverse proxy. Stripe retries. But your handler already half-processed the event — wrote to one table but not another. Now you're in a partial-state mess.
The fix isn't 'tune the connection pool.' It's: separate webhook ingestion from webhook processing. Persist the raw payload first, return 200 to Stripe, then do the actual work asynchronously.
3. The n8n / Make container that restarted
Many teams have Stripe → n8n → app for the orchestration. n8n self-hosted is a Node process. Process restarts (deploys, OOM kills) take 10-60 seconds. Stripe doesn't wait that long before considering the request failed.
The pattern that fixes it: put a buffer in front of n8n. Stripe → buffer → n8n. The buffer's only job is to durably accept and acknowledge. n8n becomes a downstream consumer that the buffer retries against.
4. The retry-window expiry on a long incident
Stripe retries for three days. If your incident outlasts that window — a database failover gone wrong, a long DNS issue — events from the start of the incident expire before your stack is healthy again. They're permanently gone from Stripe's perspective.
This is rare but catastrophic when it happens. The defense is the same: durably store the raw payload at the edge, before any business logic runs, in a way that's independent of your application's health.
5. The handler bug that 4xx'd a class of events
You shipped a code change that incorrectly 4xx'd events with a specific shape — say, subscriptions with a trial period. Stripe sees 4xx and stops retrying (because Stripe assumes 4xx means 'this request is permanently rejected'). By the time you notice (a customer ticket?), the events are gone from Stripe.
Defense: never return 4xx from a webhook handler unless you're 100% sure the payload is malformed. Even better: always return 200 from the ingestion layer and surface handler errors out-of-band so retries are under your control, not Stripe's.
The pattern that fixes all five
Each failure mode has a different proximate cause but the same shape: your stack was unable to process the event during the window the sender allowed. The fix is to decouple ingestion from processing.
Concretely: put a tiny, dedicated, Postgres-backed service between Stripe and your application. It does three things: verify Stripe's HMAC signature, write the entire raw payload to Postgres, return 200. That's it. From there, a separate worker delivers to your real handler with retries that are under your control — not Stripe's.
- Persist before delivery (1 ms write, before any business logic runs)
- Acknowledge to the sender immediately so their retry counter doesn't tick
- Retry to your downstream with exponential backoff under your control
- On permanent failure, surface the payload in a UI so a human can edit and replay
This is exactly what Zalapier is. You can also build it yourself — it's about 200 lines of Elixir on top of Postgres and Oban. Whichever path you pick, the architectural pattern (decouple ingestion from processing) is the high-leverage fix.
When you don't need this
If you ship once a week, your downstream is mature, and your webhook volume is low, Stripe's built-in retries will cover you. The buffer pattern is overkill until you have one of: high deploy frequency, fragile downstream (n8n, Make, your own legacy code path), webhook traffic above ~10k events/month, or a business model where a single missed event is genuinely expensive (subscriptions, payments).
If any of those describe you, a webhook buffer is the cheapest reliability investment you'll make this year.
Buffer your Stripe webhooks in 3 minutes
Zalapier persists every Stripe event before delivery. Free for 1,000 events/month. No credit card.
Start free →