Contact
Strategy

कस्टमर डेटा प्लेटफ़ॉर्म शुरू से बनाना

Empirium Team12 min read

Segment charges $120/month for their cheapest plan and $12,000+/month for the features you actually need (identity resolution, computed traits, audiences). For a B2B company tracking 50,000 users across web, email, CRM, and billing — the annual cost of a commercial CDP runs $60,000-$200,000.

The functions a CDP performs — collecting events, resolving identities, building user profiles, and activating segments — aren't magic. They're data engineering patterns that you can build on your own infrastructure with open-source tools and a data warehouse you probably already have.

Here's how to build a CDP from scratch, when it makes sense, and when you should just buy one.

What a CDP Actually Does

Strip away the marketing buzz and a CDP performs four functions:

1. Event Collection

Capture user interactions across every touchpoint: page views, button clicks, form submissions, email opens, API calls, support tickets. Each event includes a user identifier, timestamp, event type, and properties.

This is the same event tracking that Google Analytics, Mixpanel, or Amplitude provide — but under your control, in your infrastructure, with your data retention policies.

2. Identity Resolution

The same person visits your website anonymously, fills out a form (now you know their email), downloads your app (now they have a device ID), and contacts support (now they have a ticket ID). Identity resolution connects these disparate identifiers into a single unified profile.

This is the hardest and most valuable function. Without it, you have fragmented data about anonymous sessions. With it, you have a complete view of each customer's journey.

3. Audience Segmentation

Based on unified profiles, create dynamic segments: "users who visited pricing in the last 7 days but haven't started a trial," "customers whose usage dropped 50% month-over-month," "leads who match our ICP and have a lead score above 70."

These segments are computed from the unified data and update automatically as new events arrive.

4. Activation

Push segments and computed user attributes to operational tools: enriched profiles to your CRM, high-intent segments to ad platforms, churn risk scores to customer success tools, personalization signals to your website.

This is the reverse ETL pattern — data flows into the warehouse for computation and flows back out for action.

The DIY Architecture

A DIY CDP has four layers, each built with open-source or low-cost tools:

Collection          →  Storage         →  Computation      →  Activation
─────────────         ─────────────      ──────────────       ─────────────
Jitsu / Snowplow      BigQuery /         dbt models           Census /
Custom SDKs           Snowflake /        Identity graph       Hightouch /
Server-side events    Postgres           Segment logic        Custom APIs
Webhook receivers                        Computed traits

Layer 1: Event Collection

Option A: Jitsu (open-source Segment alternative)

Jitsu provides JavaScript SDKs, server-side SDKs, and webhook receivers that collect events and pipe them directly to your data warehouse. It's a drop-in replacement for Segment's collection layer.

Setup: Docker container + JavaScript snippet on your site. Cost: $0 (self-hosted) or $99/month (cloud).

Option B: Snowplow (enterprise-grade event collection)

Snowplow is the most comprehensive open-source event collection platform. It supports web, mobile, server-side, and IoT event collection with schema validation and enrichment.

Setup: More complex than Jitsu — requires AWS/GCP infrastructure (Kinesis/Pub/Sub, S3/GCS, BigQuery/Redshift). Cost: $0 (open-source) + $200-$500/month infrastructure.

Option C: Custom event collection

For simpler needs, a custom API endpoint that receives events and writes them to your warehouse works fine:

POST /api/events
{
  "user_id": "usr_123",
  "anonymous_id": "anon_abc",
  "event": "pricing_page_viewed",
  "timestamp": "2026-05-08T14:30:00Z",
  "properties": { "plan": "pro", "source": "google" }
}

A Node.js or Python service that validates, enriches, and batch-inserts events into BigQuery takes 2-3 days to build.

Layer 2: Storage

Your data warehouse is the storage layer. Events, user profiles, and computed segments all live here. BigQuery, Snowflake, or Postgres — whichever you're already using.

Schema design for CDP storage:

Table Purpose Key Columns
raw_events All collected events event_id, user_id, anonymous_id, event_type, properties, timestamp
identity_graph User identity mappings canonical_id, identifier_type, identifier_value, first_seen, last_seen
user_profiles Unified user attributes canonical_id, email, name, company, first_seen, last_active, computed traits
segments Segment membership canonical_id, segment_name, entered_at, exited_at

Layer 3: Computation (dbt Models)

dbt transforms raw events into useful structures:

Identity resolution model: Joins anonymous sessions with identified users based on shared identifiers (same device, same IP + browser fingerprint, explicit identification via form fill).

User profile model: Aggregates all events and attributes for each resolved user into a single profile row. Computes derived traits: total page views, last active date, product usage score, engagement level.

Segment model: Applies segment definitions as SQL WHERE clauses against the user profile model. Dynamic segments recompute on every dbt run (typically every 15-60 minutes).

Layer 4: Activation

Push computed profiles and segments back to operational tools. Options:

Reverse ETL platforms: Census ($500-$2,000/month) or Hightouch ($500-$1,500/month) connect your warehouse to 100+ operational tools. They sync computed segments to your CRM, ad platforms, email tools, and customer success platforms.

Custom sync scripts: For simpler needs, a scheduled job that queries the warehouse and updates your CRM via API costs nothing beyond the development time. This works for 2-3 destination tools but becomes maintenance-heavy past that.

Identity Resolution Without Vendor Lock-In

Identity resolution is the core value of a CDP and the hardest component to build well.

Deterministic Matching

Match records that share the same exact identifier:

  • Same email address across web event and CRM record
  • Same user ID across web and mobile
  • Same phone number across form submission and support ticket

Deterministic matching is straightforward and reliable. Start here.

Probabilistic Matching

Match records that are likely the same person but don't share an exact identifier:

  • Same IP address + same browser fingerprint within 24 hours
  • Same company domain + same first name + same job title
  • Same device type + same geographic location + sequential page views

Probabilistic matching is useful but risky. False positives (merging two different people into one profile) are worse than false negatives (keeping profiles separate). Default to conservative matching — it's easier to merge profiles later than to split incorrectly merged ones.

The Identity Graph

Store identity resolution results in a graph structure:

canonical_user_123
  ├── email: [email protected]
  ├── anonymous_id: anon_456 (web session March)
  ├── anonymous_id: anon_789 (web session April)
  ├── user_id: usr_123 (app login)
  ├── crm_id: sf_00123 (Salesforce)
  └── support_id: zen_456 (Zendesk)

Each canonical user has multiple identifiers. When a new event arrives with any of these identifiers, it's attributed to the canonical user. When a new identifier is discovered (anonymous session later identified via form fill), the graph is updated and historical events are re-attributed.

When to Buy vs Build

Factor Build DIY Buy (Segment/mParticle/RudderStack)
Monthly data volume < 100M events > 100M events or need real-time
Budget < $2,000/month for data infrastructure > $5,000/month budget for CDP
Team capability Has data engineer + dbt knowledge No data engineering capacity
Customization needs Highly specific to your business Standard e-commerce or SaaS patterns
Data residency Must control where data lives Flexible on data location
Identity complexity < 5 identifier types > 5 types, complex matching rules
Activation destinations < 5 tools > 10 tools

The Middle Ground: RudderStack and Jitsu

Open-source CDPs like RudderStack and Jitsu provide Segment-like functionality with self-hosting options:

Tool Collection Identity Warehouse Sync Self-Hosted Cost
RudderStack Full SDKs Basic Native $0 + infra ($200-$500/month)
Jitsu Full SDKs Basic Native $0 + infra ($100-$300/month)
Segment Full SDKs Advanced Native Cloud-only ($12,000+/year)

RudderStack provides 80% of Segment's functionality at 20% of the cost. If you need robust event collection and warehouse integration but can handle identity resolution in dbt, RudderStack is the best compromise.

FAQ

How do we handle GDPR/privacy in a DIY CDP? Build deletion capability from day one. When a user requests data deletion, you must be able to: identify all records associated with their identity, delete them from the warehouse and all downstream systems, and confirm deletion within 30 days. This is easier in a DIY CDP (you control the data) than in a commercial one (you depend on the vendor's deletion process).

Real-time vs batch processing — which do we need? For most B2B use cases, batch processing (every 15-60 minutes) is sufficient. Real-time processing matters for: in-app personalization, real-time chat routing based on user profile, and time-sensitive triggered actions. If you need real-time, consider RudderStack or Jitsu's streaming capabilities rather than building a real-time pipeline from scratch.

How long should we retain event data? Keep raw events for 13-25 months (enough for year-over-year comparison). Keep computed profiles and segments indefinitely (they're small). Implement data lifecycle policies that automatically archive or delete raw events past the retention window. Storage is cheap, but unbounded retention creates GDPR liability.

How do we test identity resolution? Create test users with known multiple identifiers. Push events through each identifier. Verify that the identity graph correctly resolves all identifiers to one canonical user. Run this test suite after every change to your identity resolution logic. False merges are insidious — they corrupt downstream data silently.

A DIY CDP is a realistic project for any B2B company with a data warehouse and basic dbt skills. It won't replicate every feature of Segment or mParticle, but it will give you the event collection, identity resolution, and activation capabilities that drive 90% of the value — at a fraction of the cost and with full data ownership. Empirium builds custom data infrastructure for B2B operators — from warehouse setup to CRM integration to CDP architecture. Talk to us.

Written by Empirium Team

Explore More

Deep-dive into related topics across our five pillars.

Pillar Guide

मॉडर्न मार्केटिंग ऑपरेशन्स स्टैक: रेफ़रेंस आर्किटेक्चर

A layer-by-layer breakdown of the marketing operations stack that actually works for B2B operators in 2026. CRM, automation, analytics, and integration patterns.

View all Strategy articles

Related Resources

Need help with this?

Talk to Empirium