Identity Resolution in a Warehouse-Native Analytics Stack

The problem, stated plainly

A user starts anonymous, later signs up, and now events exist under both anonymous_id and user_id. Product analytics must decide which pre-signup events attach to the known user.

Approach 1 — Identify call merges IDs

Classic CDP pattern using a mapping table:

CREATE TABLE user_identities (
  user_id       TEXT,
  anonymous_id  TEXT,
  identified_at TIMESTAMP,
  PRIMARY KEY (user_id, anonymous_id)
);

WITH anonymous_to_user AS (
  SELECT anonymous_id, MIN(user_id) AS user_id
  FROM user_identities
  GROUP BY anonymous_id
)
SELECT
  e.event_time,
  COALESCE(e.user_id, a.user_id) AS user_id,
  e.anonymous_id,
  e.event_name,
  e.properties
FROM events e
LEFT JOIN anonymous_to_user a
  ON a.anonymous_id = e.anonymous_id;

Great baseline, but struggles with multi-device and late-arriving identity events.

Approach 2 — Identity graph

Model identity as a graph of identifiers and edges, then compute connected components for canonical user IDs.

CREATE TABLE identifiers (
  id_value      TEXT,
  id_type       TEXT,
  observed_at   TIMESTAMP,
  PRIMARY KEY (id_value, id_type)
);
 
CREATE TABLE identity_edges (
  id_value_a    TEXT,
  id_type_a     TEXT,
  id_value_b    TEXT,
  id_type_b     TEXT,
  source_event  TEXT,
  observed_at   TIMESTAMP,
  PRIMARY KEY (id_value_a, id_type_a, id_value_b, id_type_b)
);

Use transitive closure (union-find style) to build stable klaritics_user_id assignments.

Approach 3 — Deterministic vs probabilistic

Deterministic matching (shared exact identifiers) is auditable and high-confidence.

Probabilistic matching increases match rate but introduces ambiguity. Klaritics defaults deterministic, with probabilistic as explicit opt-in.

Real-time vs batch resolution

Batch resolution is fine for most funnel/retention workflows.

Real-time personalization and launch-time dashboards may need pipeline-time resolution or low-latency identity updates.

Recommended schema

events (
  event_time     TIMESTAMP,
  user_id        TEXT,
  anonymous_id   TEXT,
  event_name     TEXT,
  properties     JSON
)
 
identity_edges (
  id_a_value     TEXT,
  id_a_type      TEXT,
  id_b_value     TEXT,
  id_b_type      TEXT,
  observed_at    TIMESTAMP
)
 
identity_resolved (
  source_id_value     TEXT,
  source_id_type      TEXT,
  klaritics_user_id   TEXT,
  resolution_method   TEXT,
  resolved_at         TIMESTAMP,
  PRIMARY KEY (source_id_value, source_id_type)
)

And a view:

events_resolved AS (
  SELECT
    e.*,
    COALESCE(r1.klaritics_user_id, r2.klaritics_user_id) AS klaritics_user_id
  FROM events e
  LEFT JOIN identity_resolved r1
    ON r1.source_id_type = 'user_id' AND r1.source_id_value = e.user_id
  LEFT JOIN identity_resolved r2
    ON r2.source_id_type = 'anonymous_id' AND r2.source_id_value = e.anonymous_id
)

What we still get wrong

Late-arriving identifies are inherently tricky.
Shared-device environments create false positives.
Cross-device anonymous stitching is often impossible without additional signals.

Why warehouse-native is better here

Identity logic lives in your warehouse, in SQL you control and version.

When it's wrong, you can fix it directly without waiting on vendor support or black-box behavior changes.

Try it

Deploy Klaritics → · Talk to engineering

About the author

The Klaritics engineering team writes about query planning, performance, and architecture in warehouse-native analytics systems.

Build your analytics stack your way

Klaritics ships with default identity-resolution dbt models that you can read, modify, and own.

Talk to engineering Contact us

Identity resolution in a warehouse-native world

TL;DR

The problem, stated plainly

Approach 1 — Identify call merges IDs

Approach 2 — Identity graph

Approach 3 — Deterministic vs probabilistic

Real-time vs batch resolution

Recommended schema

What we still get wrong

Why warehouse-native is better here

Try it

About the author

Related posts

How we query 10B events on Snowflake in under 2 seconds

The hidden cost of reverse ETL

Mixpanel vs. Amplitude vs. Klaritics: a fair comparison

Build your analytics stack your way

Stop moving data. Start analyzing it.