Webhook-arkkitehtuuri luotettaviin integraatioihin
Webhooks are the backbone of real-time integrations. When a customer pays on Stripe, a webhook notifies your CRM. When a support ticket closes in Zendesk, a webhook updates your customer health score. When a pull request merges on GitHub, a webhook triggers a deployment.
Simple concept. Deceptively difficult to get right.
The problem with webhooks is that they fail silently. A dropped webhook doesn't throw an error in your application — it just doesn't arrive. Data goes missing. Records fall out of sync. And nobody notices until a customer complains or a quarterly reconciliation reveals discrepancies that have been accumulating for months.
Here's how to build webhook infrastructure that doesn't lose data.
Why Webhooks Fail Silently
Webhooks fail at four points, and each failure mode is invisible by default.
1. Network Failures
The sender fires a POST request to your endpoint. If your server is down, slow, or returns a non-2xx status code, the webhook fails. Most providers retry — but retry behavior varies wildly:
| Provider | Retry Attempts | Retry Window | Backoff Strategy |
|---|---|---|---|
| Stripe | 16 | 3 days | Exponential |
| GitHub | 3 | ~1 hour | Fixed intervals |
| HubSpot | 10 | 24 hours | Exponential |
| Slack | 3 | ~30 minutes | Exponential |
| Shopify | 19 | 48 hours | Exponential |
If your server has a 30-minute outage and you're receiving GitHub webhooks, you'll lose events permanently after 3 retries. Stripe gives you 3 days of retries — far more forgiving. Your architecture needs to account for the least forgiving provider in your stack.
2. Timeout Failures
Most webhook providers expect a response within 5-30 seconds. If your endpoint does heavy processing before responding — querying a database, calling another API, running business logic — it will timeout. The provider interprets the timeout as a failure and retries, potentially causing duplicate processing.
3. Payload Validation Failures
Webhook payloads change. A provider adds a new field, changes a field type from string to integer, or nests data differently in a new API version. If your parsing code is strict, it breaks on the new payload shape and silently drops events.
4. Ordering and Duplication
Webhooks arrive out of order. A payment.succeeded event might arrive before payment.created. Retry storms can deliver the same event multiple times. Without idempotency handling, you process events twice — creating duplicate records, sending duplicate emails, or charging customers double.
The Reliable Webhook Architecture
The core principle: acknowledge immediately, process asynchronously. Your webhook endpoint should do exactly two things: verify the signature and enqueue the payload. Everything else happens in a background worker.
┌──────────────┐
Provider ──POST──→│ Endpoint │
│ 1. Verify sig │
│ 2. Enqueue │
│ 3. Return 200 │
└──────┬───────┘
│
┌──────▼───────┐
│ Queue (SQS/ │
│ Redis/Bull) │
└──────┬───────┘
│
┌──────▼───────┐
│ Worker │
│ 1. Dedup │
│ 2. Process │
│ 3. Log │
└──────────────┘
Signature Verification
Every reputable webhook provider signs payloads with an HMAC or asymmetric signature. Verify it before processing. This prevents forged webhook attacks — an attacker who knows your endpoint URL could send fake events.
Stripe uses HMAC-SHA256. GitHub uses HMAC-SHA256. HubSpot uses HMAC-SHA256 with a client secret. The verification logic is different for each provider but the principle is the same: compute the expected signature from the raw request body and your secret key, then compare with the signature header.
Never skip this step, even in development. A habit of ignoring signatures leads to production endpoints that don't verify.
Queue-First Processing
The endpoint's only job is to put the raw payload into a message queue and return 200. This ensures:
- Fast response times. The endpoint responds in <100ms, well within any provider's timeout window.
- No data loss. If the worker is down, events accumulate in the queue and process when the worker recovers.
- Retry capability. Failed worker jobs can be retried from the queue without asking the provider to resend.
- Rate limiting. The worker can process at whatever pace your downstream systems handle, regardless of incoming webhook volume.
Queue options:
| Queue | Best For | Complexity | Cost |
|---|---|---|---|
| Redis + BullMQ | Small-medium volume, self-hosted | Low | $0 (self-hosted) |
| AWS SQS | High volume, managed | Medium | ~$0.40/million messages |
| RabbitMQ | Complex routing, self-hosted | Medium | $0 (self-hosted) |
| Google Cloud Pub/Sub | High volume, GCP stack | Medium | ~$0.40/million messages |
For most B2B applications handling under 100,000 webhooks/day, Redis + BullMQ is the simplest choice. It runs on the same server as your application, requires no additional infrastructure, and provides retry logic, dead-letter queues, and job monitoring out of the box.
Idempotency
Every webhook payload should be processed exactly once, regardless of how many times it's delivered. The implementation:
- Extract the event ID from the payload (most providers include one — Stripe's
evt_xxx, GitHub'sX-GitHub-Deliveryheader). - Before processing, check if this event ID exists in your processed events store (a database table or Redis set).
- If it exists, skip processing and return success.
- If it doesn't exist, process the event and record the event ID.
Use a database unique constraint or Redis SET NX for atomic check-and-insert. Don't use a check-then-insert pattern — it's vulnerable to race conditions when duplicate webhooks arrive simultaneously.
Dead-Letter Queue
When a webhook event fails processing after exhausting retries (typically 3-5 attempts), it moves to a dead-letter queue (DLQ). The DLQ stores failed events for manual inspection and reprocessing.
Critical: alert on DLQ depth. A growing DLQ means events are failing systematically, not transiently. Common causes: schema changes in the provider's payload, expired API credentials in your processing logic, or a bug introduced in a recent deployment.
Common Provider Patterns
Each webhook provider has quirks that your integration code must handle.
Stripe
Stripe's webhooks are the gold standard. Event objects are versioned, payload structure is consistent, and the dashboard includes a webhook event log with replay capability. Always pin your Stripe API version and listen for api_version in webhook events to detect mismatches.
Key pattern: Stripe sends notification events, not complete data. A customer.subscription.updated event tells you something changed — it doesn't include the full diff. Your worker should fetch the current state from the Stripe API rather than relying solely on the webhook payload.
GitHub
GitHub webhooks include the event type in the X-GitHub-Event header and a unique delivery ID in X-GitHub-Delivery. The payload structure varies significantly between event types. Plan to handle at least push, pull_request, issues, and workflow_run if you're building CI/CD integrations.
HubSpot
HubSpot batches webhook events. A single webhook request can contain multiple events. Your endpoint must iterate over the batch rather than processing the payload as a single event. HubSpot also rate-limits webhook creation — you can only subscribe to 1,000 subscriptions per app.
Slack
Slack webhooks require URL verification — when you register a webhook endpoint, Slack sends a url_verification challenge that your endpoint must echo back. Slack also enforces a 3-second response timeout, making queue-first processing mandatory.
Monitoring and Debugging
Logging Strategy
Log every webhook event at three points:
- Receipt. Log the raw payload (redact sensitive fields like payment info), event type, provider, and timestamp. This is your audit trail.
- Processing. Log what the worker did — records created, updated, or skipped. Include the event ID for correlation.
- Failure. Log the full error, stack trace, and the event that triggered it. Include enough context to reproduce the failure.
Store webhook logs for at least 90 days. When a data discrepancy surfaces, you'll need to replay the timeline of events to diagnose the root cause.
Replay Capability
Build the ability to replay any webhook event from your logs. This means:
- Storing the raw, unmodified payload
- A replay endpoint or script that feeds stored payloads through your worker
- Idempotency handling that allows safe replay without duplicate side effects
This capability is invaluable during incident response. When a bug in your worker corrupts data for 48 hours, you can fix the bug, clear the affected records, and replay every webhook from the past 48 hours.
Alerting Rules
| Alert | Condition | Severity |
|---|---|---|
| Endpoint down | No webhooks received in 15 minutes (during business hours) | Warning |
| Processing lag | Queue depth > 1,000 events | Warning |
| Failure spike | Error rate > 5% in a 5-minute window | Critical |
| DLQ growth | Dead-letter queue depth > 0 | Critical (investigate immediately) |
| Signature failure | Any verification failure | Critical (possible attack or config drift) |
FAQ
When should we use webhooks vs polling? Webhooks for time-sensitive events where you need near-real-time notification (payments, user actions, CI/CD triggers). Polling for data that changes infrequently or where you need complete state snapshots (CRM records, inventory, analytics aggregates). If the provider offers both, prefer webhooks for freshness and polling for completeness — run a nightly reconciliation poll to catch anything webhooks missed.
How do we test webhook integrations locally? Use ngrok or Cloudflare Tunnel to expose your local development server to the internet. Configure the provider to send webhooks to your tunnel URL. For automated testing, mock the webhook payloads in your test suite and send them directly to your endpoint handler.
What's the maximum payload size we should expect? Most providers limit payloads to 1-5 MB. Stripe payloads are typically under 50 KB. GitHub payloads can reach several MB for large push events. Configure your web server and queue to handle the maximum expected size. Reject payloads over 10 MB as a safety measure.
How do we handle webhook migrations when switching providers? Run both webhook endpoints simultaneously during the transition. Process events from both with deduplication to prevent doubles. Once the new provider is confirmed working (monitor for 1-2 weeks), deactivate the old webhook endpoint. Never cut over in a single step.
Webhook architecture done right is invisible — events flow, data stays in sync, and your team never has to think about it. Done wrong, it's a source of silent data corruption that compounds for months. The investment in queue-first processing, idempotency, and monitoring pays for itself the first time it prevents a data incident. Empirium builds reliable integration infrastructure for B2B operators — let's discuss your integration needs.