Counting the uncountable: how we replaced 115 days of computation with a few seconds of math
What do you do when counting your users means calling an API 200 million times?
You're a marketer about to launch a personalized campaign. You've defined your target audience (users over 25 who've completed a purchase) and before you commit resources to building creative, running A/B tests, and coordinating a rollout, you need to answer one question: "How many users actually match this audience?"
It matters more than you'd think. Personalization only pays off if the audience is large enough to move the needle. Conversion rates on targeted content are typically in the low single digits. If your audience turns out to be 50,000 users instead of the 5 million you assumed, even a stellar 3% conversion rate means 1,500 conversions. Probably not worth the effort.
The same goes for A/B testing: run an experiment on an audience that's too small and you'll wait weeks for statistical significance that never arrives, burning time and opportunity cost while the test sits inconclusive.
You need the size estimate before you invest, not after. And it sounds like a five-second problem. Write a COUNT(*) with a WHERE clause, run it, done.
Here's what that audience definition looks like in CQL:
user's age is greater than 25 and user has completed a goal "purchase"And here's where the simplicity ends.
CQL is not SQL. We're not dealing with indexable columns in a database table. That expression evaluates against live, dynamic user context: behavioral history, session state, real-time attributes, cross-device identity.
There is no table you can SELECT COUNT(*) from. There is no materialized column called has_completed_purchase sitting in a warehouse somewhere. The only way to know whether a user matches is to ask an evaluation service that resolves identity, stitches sessions, and materializes user state on the fly.
Each evaluation is an HTTP call. Now do that for 200 million users.
The "easy" problem just got a lot harder.
Why brute force is off the table
Let's see just how bad the naive approach is.
Each CQL evaluation is an HTTP round-trip to an evaluation service that does real work: resolving identity graphs, querying behavioral stores, materializing computed attributes. Even at an optimistic 50ms per evaluation, running 200 million of them would take roughly days of sequential computation.
Parallelism helps, of course. Throw 1,000 concurrent workers at it and you're down to about 2.8 hours. But nobody wants to wait 2 hours every time they adjust an age threshold or swap one goal for another.
Audience definition is an iterative process. Marketers tweak, preview, tweak again. A 2-hour feedback loop kills that workflow entirely. And the cost scales linearly with your user base. Double your users, double your compute. That's not a system; it's a liability.
We set ourselves a different bar entirely:
- Fast: under 5s at p95. Fast enough to run while a marketer is still looking at the screen.
- Accurate: 98% confidence level with no more than 10% margin of error.
- Cheap: orders of magnitude less than full evaluation.
- Scalable: near-constant time regardless of audience size. Whether you have 100,000 users or 100 million, the response time should be roughly the same.
These constraints rule out brute force entirely. But they also contain an important clue.
Notice we said "margin of error," not "exact answer." Marketing decisions don't hinge on whether the audience is 12,847,293 or 12,900,000. We don't need the exact count. We need a reliable estimate with known confidence bounds.
That changes everything.
Sample, evaluate, extrapolate
The insight is old. It predates computers, databases, and even electricity.
When a polling firm wants to predict the outcome of an election with 200 million eligible voters, they don't interview all 200 million people. They interview a few hundred, carefully chosen, and extrapolate. The mathematics of this are well-established and, frankly, a little surprising.
The question we need to answer is: what proportion of our user population matches a given CQL criteria? Think of each user as a coin flip. We don't know whether the coin is fair or heavily weighted, but if we flip it enough times, we can estimate the odds.
How many flips do we need? The math says: surprisingly few. For a 95% confidence interval with a 5% margin of error, the answer is 385 samples. That's it. Whether your population is 100,000 or 100 million, 385 randomly chosen users get you there. The number barely moves with population size.
At 50ms per evaluation with moderate concurrency, a few hundred evaluations complete in seconds. The entire estimation (sampling, evaluating, and computing the result) comes in well under the 5-second target.
Not all randomness is created equal
There is a catch, and it's a big one. The statistical guarantees above hold only if the sample is truly random. Each user must have an equal probability of being selected, independent of every other user.
In a textbook, this is a one-line assumption. In a real database, it's a hard engineering problem.
The TABLESAMPLE trap
PostgreSQL provides a built-in sampling mechanism: TABLESAMPLE SYSTEM. It randomly selects physical disk blocks (typically 8KB pages) and returns all rows from those blocks. Sounds reasonable. It's fast. It's built-in.
And it's completely wrong for our use case.
PostgreSQL heaps are append-only by default. Rows are stored in insertion order. Users who registered around the same time, within milliseconds of each other, share the same physical disk block.
When TABLESAMPLE SYSTEM selects a block, it selects an entire cohort of users who signed up at the same time, live in the same geography, and likely share similar behavioral patterns. If User A is selected, User B, who registered milliseconds later and sits on the same 8KB page, is virtually guaranteed to be selected too.
The "random" sample is actually a collection of tight temporal clusters. Any confidence interval computed from it is meaningless.
The UUID insight
Our user table uses UUID v4 primary keys. UUID v4 values are generated by a cryptographic pseudorandom number generator. Their ordering in a B-tree index is completely uncorrelated with registration time, geography, behavior, or any other user attribute.
Two users sitting next to each other in UUID-sorted order might have registered years apart, in different countries, with entirely different usage patterns.
The randomness we need isn't in the sampling mechanism. It's already in the data.
Method 1: B-Tree random probing
We generate a random UUID as a pivot point and grab consecutive rows from the B-tree index starting at that pivot. The query scans in both directions from the pivot to handle edge cases where the random UUID lands near the boundary of the keyspace.
Because UUIDs v4 are uniformly distributed, consecutive rows in UUID order are not consecutive in any meaningful user attribute. They're effectively independent draws.
This technique falls under what Frank Olken called "random probing of B-trees" in his seminal 1989 VLDB paper Random Sampling from B+ Trees. Olken identified a subtle bias: when you pick a random search key and jump to the nearest existing key, keys that follow large gaps in the keyspace are more likely to be selected.
But UUID v4 keys are uniformly distributed across a massive keyspace, so the gaps between adjacent keys are roughly equal. The bias vanishes.
Taking N consecutive rows from a random starting point is technically cluster sampling. But because the ordering dimension (UUID) is entirely uncorrelated with any user attribute, picking a cluster of neighbors in UUID order is statistically equivalent to picking users completely at random.
Method 2: Hash-based Bernoulli sampling
When we need a percentage-based sample, say exactly 5% of users, we take a different approach. We hash each user's UUID into a bucket from 0 to 99 using PostgreSQL's hashtext() function, then select all rows whose bucket falls below the desired percentage threshold (e.g., hashtext(id::text) % 100 < 5).
Each row gets an independent coin flip with a 5% chance of landing in the sample, independently of every other row. This is textbook Bernoulli sampling. No cohort bias, no correlation between selections. And it's deterministic: the same user always falls in the same bucket, making results reproducible.
The mathematical foundation for this approach is well-established in the database literature. Chaudhuri, Motwani, and Narasayya formalized the validity of hash-based sampling in their 1999 SIGMOD paper On Random Sampling over Joins, showing that hash-based signatures preserve statistical validity across relational operations while allowing the query optimizer to efficiently utilize B-tree indexes.
Bayesian inference in practice
Once we have our sample, each user is evaluated against the CQL criteria via an HTTP call to the evaluation engine. The engine resolves the user's identity, materializes their context, and returns a boolean: match or no match.
Evaluations run concurrently in controlled batches to avoid overwhelming the evaluation service. If an individual evaluation fails (timeout, transient error, service hiccup), we simply discard it. This reduces the effective sample size but doesn't bias the results. A failed evaluation tells us nothing about whether the user matches, so the correct thing to do is ignore it entirely.
From counts to confidence
Say we evaluate 385 users and 120 of them match. The naive estimate is simple: . But how confident should we be in that number? And what range of values is plausible?
The simplest approach would be to just report 31% and move on. But that breaks down at the edges. If all 385 users match, the naive estimate is 100%. If none match, it's 0%. Both are almost certainly wrong. Just because everyone in your sample matched doesn't mean literally every user in the population will.
We use a Bayesian approach instead, following the framework described in Bayesian Data Analysis by Gelman et al. The intuition is straightforward: before looking at the data, we start with no assumptions about the true proportion. It could be anything from 0% to 100%. Then we update our belief based on what we observe.
Imagine a jar with millions of marbles, some red, some blue. You can't see inside, but you can draw a handful. If you pull 30 red and 70 blue, your best guess is that about 30% of the jar is red. But you wouldn't bet your life on exactly 30%. Maybe it's 25%, maybe 35%. The Bayesian approach gives you both the estimate and the range of plausible values, and it gets sharper the more marbles you draw.
Mathematically, this is Laplace's Rule of Succession: we start with a uniform prior (all proportions equally likely) and update it with the observed data. The prior also acts as a gentle nudge away from extreme estimates. With 385 successes and 0 failures, instead of reporting 100%, the model reports . A small correction, but it prevents degenerate estimates that could mislead downstream decisions.
From the updated belief we extract two things: a best estimate of the proportion, and a credible interval that captures how uncertain we are. Brown, Cai, and DasGupta provide a thorough analysis of interval estimation techniques in Interval Estimation for a Binomial Proportion.
Finally, we scale proportions to absolute numbers by multiplying by the total population. If we estimate 30% with a interval and there are 10 million users, the result is 3 million users matching, with a confidence interval of .
Where accuracy breaks down
The margin of error is not constant. It depends on the true proportion, and it gets dramatically worse for small audiences.
Consider a sample of 385 users. If 200 of them match (about 52%), the relative margin of error is around 9.6%. Perfectly fine for decision-making. But if only 5 match (about 1.3%), the relative margin of error balloons to 87%. The confidence interval is technically correct, but the estimate is so wide as to be nearly useless.
This isn't a flaw in the method. It's a fundamental property of proportion estimation. When the true proportion is near 0% or 100%, the variance of the Binomial distribution shrinks, but the relative uncertainty compared to the tiny estimate grows.
Estimating that 1% of 200 million users match (2 million) versus 2% (4 million) is a much harder statistical problem than distinguishing 50% from 51%.
Our approach handles this gracefully in two ways.
First, the Bayesian prior helps at the extremes. When zero samples match, the naive estimate is exactly 0%. Our model gives a small but nonzero estimate that honestly reflects the uncertainty. We know from only 385 observations that the proportion is probably very small, but we can't be sure it's exactly zero.
Second, we report confidence intervals alongside point estimates, making the uncertainty explicit. When the interval is , that's genuinely useful information: the audience is small, probably under 5% of the population. The marketer can act on that even without a precise number.
The power of "good enough"
The marketer asked "how many users match this audience?" and got an answer in single-digit seconds, from a sample of a few hundred users out of hundreds of millions.
The estimate is not exact. But it's useful. It comes with confidence bounds that quantify the uncertainty. It's fast enough to run interactively, while the marketer is still refining the audience definition, tweaking thresholds, comparing segments.
And it works with any CQL expression, no matter how complex the criteria, how many data sources it touches, or how much real-time context it requires. The estimation cost is decoupled from the complexity of the query.
We're bringing this directly into the product. Soon, you'll see audience size estimates right on the audience page, giving you instant feedback as you define and refine your segments.
That's what makes this problem fun. Behind a single API response sits a chain of ideas spanning centuries: Laplace's Rule of Succession, Olken's B-tree sampling, Bernoulli trials, Bayesian conjugate priors. Each one solving a specific piece of the puzzle, turning 115 days of brute force into a few seconds of math.