The problem, stated plainly
A user starts anonymous, later signs up, and now events exist under both anonymous_id and user_id. Product analytics must decide which pre-signup events attach to the known user.
Approach 1 — Identify call merges IDs
Classic CDP pattern using a mapping table:
CREATE TABLE user_identities (
user_id TEXT,
anonymous_id TEXT,
identified_at TIMESTAMP,
PRIMARY KEY (user_id, anonymous_id)
);WITH anonymous_to_user AS (
SELECT anonymous_id, MIN(user_id) AS user_id
FROM user_identities
GROUP BY anonymous_id
)
SELECT
e.event_time,
COALESCE(e.user_id, a.user_id) AS user_id,
e.anonymous_id,
e.event_name,
e.properties
FROM events e
LEFT JOIN anonymous_to_user a
ON a.anonymous_id = e.anonymous_id;Great baseline, but struggles with multi-device and late-arriving identity events.
Approach 2 — Identity graph
Model identity as a graph of identifiers and edges, then compute connected components for canonical user IDs.
CREATE TABLE identifiers (
id_value TEXT,
id_type TEXT,
observed_at TIMESTAMP,
PRIMARY KEY (id_value, id_type)
);
CREATE TABLE identity_edges (
id_value_a TEXT,
id_type_a TEXT,
id_value_b TEXT,
id_type_b TEXT,
source_event TEXT,
observed_at TIMESTAMP,
PRIMARY KEY (id_value_a, id_type_a, id_value_b, id_type_b)
);Use transitive closure (union-find style) to build stable klaritics_user_id assignments.
Approach 3 — Deterministic vs probabilistic
Deterministic matching (shared exact identifiers) is auditable and high-confidence.
Probabilistic matching increases match rate but introduces ambiguity. Klaritics defaults deterministic, with probabilistic as explicit opt-in.
Real-time vs batch resolution
Batch resolution is fine for most funnel/retention workflows.
Real-time personalization and launch-time dashboards may need pipeline-time resolution or low-latency identity updates.
Recommended schema
events (
event_time TIMESTAMP,
user_id TEXT,
anonymous_id TEXT,
event_name TEXT,
properties JSON
)
identity_edges (
id_a_value TEXT,
id_a_type TEXT,
id_b_value TEXT,
id_b_type TEXT,
observed_at TIMESTAMP
)
identity_resolved (
source_id_value TEXT,
source_id_type TEXT,
klaritics_user_id TEXT,
resolution_method TEXT,
resolved_at TIMESTAMP,
PRIMARY KEY (source_id_value, source_id_type)
)And a view:
events_resolved AS (
SELECT
e.*,
COALESCE(r1.klaritics_user_id, r2.klaritics_user_id) AS klaritics_user_id
FROM events e
LEFT JOIN identity_resolved r1
ON r1.source_id_type = 'user_id' AND r1.source_id_value = e.user_id
LEFT JOIN identity_resolved r2
ON r2.source_id_type = 'anonymous_id' AND r2.source_id_value = e.anonymous_id
)What we still get wrong
- Late-arriving identifies are inherently tricky.
- Shared-device environments create false positives.
- Cross-device anonymous stitching is often impossible without additional signals.
Why warehouse-native is better here
Identity logic lives in your warehouse, in SQL you control and version.
When it's wrong, you can fix it directly without waiting on vendor support or black-box behavior changes.
