Setup: what "10B events" actually means here
The numbers in this post come from a representative customer — a B2C product with:
- ~10B events spanning ~30 months of history
- ~12M monthly active users
- Snowflake account on AWS, dedicated
XSMALLwarehouse for Klaritics with auto-suspend at 60s - Event table with
event_time(TIMESTAMP_NTZ),user_id(VARCHAR),event_name(VARCHAR),properties(VARIANT) - Clustering on
event_time(the most important detail in this entire post)
The benchmarks below were measured on real customer dashboards over a 30-day window, p50 and p95 captured by the Klaritics query layer's own instrumentation.
| Query type | p50 | p95 | Rows scanned (typical) |
|---|---|---|---|
| 30-day DAU/WAU/MAU | 0.8s | 2.1s | ~400M |
| 7-day signup→activation funnel | 1.4s | 3.2s | ~80M |
| 30-day retention curve, 8 cohorts | 1.7s | 4.1s | ~600M |
| Top 20 events by volume, 7 days | 0.6s | 1.5s | ~250M |
None of these queries touch all 10B rows. That's the entire point. Let's get into how.
1. Micro-partition pruning: the lever that does most of the work
Snowflake stores data in 50–500MB micro-partitions, each with rich metadata about the column values it contains. When a query has predicates the planner can use, it skips entire micro-partitions without reading them.
Pruning works best when:
- The data is physically sorted by the partition-pruning column (or close to it)
- The query has predicates on that column
- The predicates are inequality bounds (
>=,<,BETWEEN), not just equality
For event tables, the pruning column is almost always event_time. Events arrive roughly in time order, so partitions naturally cluster on time. With explicit CLUSTER BY (event_time) on the table, Snowflake maintains this ordering even as data drifts.
Klaritics' query layer treats time-bounded predicates as the highest-priority part of every generated query. A funnel query for "the last 7 days" isn't just a WHERE event_time >= DATEADD('day', -7, CURRENT_TIMESTAMP) — it's that predicate pushed down into every CTE, every subquery, every JOIN.
A bad funnel query (Klaritics doesn't write this):
WITH all_signups AS (
SELECT user_id, event_time AS signup_time
FROM events WHERE event_name = 'signup'
),
all_activations AS (
SELECT user_id, event_time AS activated_time
FROM events WHERE event_name = 'activated'
)
SELECT COUNT(DISTINCT s.user_id), COUNT(DISTINCT a.user_id)
FROM all_signups s
LEFT JOIN all_activations a
ON s.user_id = a.user_id
AND a.activated_time BETWEEN s.signup_time AND DATEADD('day', 7, s.signup_time)
WHERE s.signup_time >= DATEADD('day', -30, CURRENT_TIMESTAMP);The same query, written for pruning:
WITH signups AS (
SELECT user_id, event_time AS signup_time
FROM events
WHERE event_name = 'signup'
AND event_time >= DATEADD('day', -30, CURRENT_TIMESTAMP)
),
activations AS (
SELECT user_id, event_time AS activated_time
FROM events
WHERE event_name = 'activated'
AND event_time >= DATEADD('day', -30, CURRENT_TIMESTAMP)
AND event_time <= DATEADD('day', -23, CURRENT_TIMESTAMP)
)
SELECT
COUNT(DISTINCT s.user_id) AS signed_up,
COUNT(DISTINCT a.user_id) AS activated
FROM signups s
LEFT JOIN activations a
ON s.user_id = a.user_id
AND a.activated_time BETWEEN s.signup_time AND DATEADD('day', 7, s.signup_time);The "redundant" bound on activations.event_time is intentional — it gives the planner a hard upper bound it can use for pruning. We're trading a tiny bit of query length for a 5–20× reduction in rows scanned.
2. Choose the right SQL pattern per warehouse
Funnels are computationally interesting because there are at least three valid SQL shapes for them:
- Shape A — Self-join chain. Easiest to read, common in tutorials, scales badly.
- Shape B — Window-function flatten. Use
LEAD()over an ordered partition to get each user's "next" event. Faster than self-joins when the funnel is short. - Shape C — Conditional aggregation per user. Build a per-user bitmap of which steps they hit, aggregate. Best for fixed-step funnels.
The right shape depends on the warehouse and the funnel definition. ClickHouse has windowFunnel() — purpose-built. Snowflake doesn't, but it has a fast LEAD/LAG implementation and excellent micro-partition pruning, so Shape B with aggressive predicate pushdown wins for most cases.
3. Aggressive caching at the query layer, never at the data layer
Klaritics caches in two places:
- Query result cache. When the same query (post-normalization) runs twice within a TTL, the second call returns from cache without hitting Snowflake.
- Aggregate cache. Common dashboard widgets are pre-computed on a schedule and served from cache, with a fallback to live query when stale.
What we do not cache:
- Raw event rows. Ever.
- User-level data outside your warehouse.
4. Right-size the warehouse for the workload
Snowflake's virtual warehouses are independent compute clusters that don't share resources. The cheapest way to make Klaritics fast is to give it its own warehouse and right-size it.
For exploratory analytics workloads, XSMALL with AUTO_SUSPEND = 60 and AUTO_RESUME = TRUE is the right starting point. For dashboard-heavy workloads, bump to SMALL or MEDIUM, or enable multi-cluster scaling during peaks.
What we're not telling you
1. The first query of the day is slower. Cold compilations and cold caches still apply.
2. Custom SQL escape hatches lose guarantees. Free-form SQL can bypass pruning best practices.
3. Schema choices matter a lot. A canonical long event table outperforms a wide pivoted table for these workloads.
The 30-second summary
To query 10B events on Snowflake in under 2 seconds:
- Cluster event tables on
event_time. - Push time predicates into every CTE, subquery, and JOIN.
- Choose SQL shape based on warehouse and funnel definition.
- Cache aggregated results, never raw data.
- Give analytics its own virtual warehouse.
Try it
If you have a Snowflake account and an event table, you can have Klaritics querying it in under 30 minutes. Deploy Klaritics →
