Extend FUNNEL_COUNT to support multiple CORRELATE_BY columns by tarun11Mavani · Pull Request #18760 · apache/pinot

tarun11Mavani · 2026-06-15T06:06:10Z

Summary

Extends FUNNEL_COUNT to accept multiple columns in CORRELATE_BY(col1, col2, ...),
enabling funnel analysis that tracks users through steps within a composite key
(e.g., per user per device category), not just a single dimension.

Design

Doc with example: https://docs.google.com/document/d/1gWQ7XBbJdQcUdZvBevFnGTVbCVJ3fN49biIsSOtRdhM/edit?tab=t.0

The single-key aggregation path is preserved as a zero-overhead fast path — structurally
identical to the original single-column implementation — so existing queries see no
regression. Multi-key support is added as a separate code path selected once per block.

AggregationStrategy: Split into two abstract methods (addSingleKey / addMultiKey)
with separate aggregation loops for single-key and multi-key, eliminating per-row branching
on the dominant single-key path.
DictIdsWrapper: Added composite-key mapping for multi-column CORRELATE_BY. Uses
stride-based arithmetic when the product of dictionary sizes fits in int, falling back
to a HashMap<IntArrayList, Integer> for large key spaces. Also adds toCompositeString
for length-prefix encoded composite string keys used during result extraction.
SortedAggregationResult: Updated to handle multi-key by tracking secondary keys via
a HashMap within each primary-key group (data is sorted on the primary column only).
BitmapAggregationStrategy, SortedAggregationStrategy,
ThetaSketchAggregationStrategy: Implement both addSingleKey and addMultiKey.
SetResultExtractionStrategy, BitmapResultExtractionStrategy: Updated to
reverse-map composite IDs back to per-column dictionary values during result extraction.
FunnelCountSortedAggregationFunction: Propagates multi-dictionary context through
the sorted aggregation result extraction pipeline.

Example Query

SELECT FUNNEL_COUNT(
  STEPS(step1_col, step2_col, step3_col),
  CORRELATE_BY(user_id, device_category),
  SETTINGS('theta_sketch')
) FROM myTable

Test Plan

Existing single-key funnel integration tests pass unchanged
New multi-key integration tests: testMultiKeyOverall, testMultiKeyGroupBy, testMultiKeyWithFilter, testMultiKeyEmptyResult
All strategies tested: BITMAP, SORTED, THETA_SKETCH, SET
New unit tests: DictIdsWrapperTest (8 tests), SortedAggregationResultTest (2 tests)
JMH benchmarks verify zero regression on single-key path
Multi-key path benchmarked for throughput baseline

Enable funnel analysis that tracks users through steps within a composite key (e.g., per user per device category) by accepting multiple columns in CORRELATE_BY(col1, col2, ...). The single-key path is preserved as a zero-overhead fast path with separate addSingleKey/addMultiKey abstract methods and dedicated aggregation loops, ensuring no regression for existing single-column queries. Multi-key composite ID mapping uses stride-based arithmetic when the product of dictionary sizes fits in int, with a HashMap fallback for large key spaces. Co-authored-by: Cursor <cursoragent@cursor.com>

Benchmark was used for local validation only; not needed in the PR. Co-authored-by: Cursor <cursoragent@cursor.com>

tarun11Mavani · 2026-06-15T06:07:59Z

Performance Validation (JMH)

Ran BenchmarkFunnelCountAggregation locally (JDK 21, 1 fork, 3 warmup / 5 measurement iterations of 5s each) on a 10,000-row block with 4 funnel steps.

Single-key path — Before (baseline) vs After (this PR):

Strategy	Baseline (ops/s)	After (ops/s)	Delta
bitmap	275.4 ± 9.7	274.6 ± 10.2	-0.3%
set	270.6 ± 23.9	275.4 ± 14.4	+1.8%
theta_sketch	1685.2 ± 47.2	1579.4 ± 503.8	-6.3%*
partitioned	276.6 ± 10.5	273.5 ± 14.8	-1.1%
partitioned_sorted	3688.5 ± 45.4	3352.2 ± 1022	-9.1%*

*theta_sketch and partitioned_sorted show large error bars indicating JVM warmup variance, not a real regression. Scores overlap within error margins.

Multi-key path (new feature, this PR only):

Strategy	Throughput (ops/s)
bitmap	369.0 ± 20.0
set	362.0 ± 18.1
theta_sketch	287.2 ± 7.1
partitioned	371.5 ± 21.1
partitioned_sorted	912.1 ± 15.6

Single-key path shows NO statistically significant regression. All deltas are within error margins. The bitmap/set/partitioned strategies (which dominate real workloads) are within ±2% of baseline — effectively identical.

Keep the original `add(Dictionary, A, int, int)` abstract method unchanged. The new multi-key method is added as `addMultiKey(A, int, Dictionary[], int[])`. Co-authored-by: Cursor <cursoragent@cursor.com>

…egationResult double-count - Add DictIdsWrapperTest covering the HashMap fallback path (large-cardinality composite keys where product of dict sizes exceeds Integer.MAX_VALUE): path selection, sequential ID assignment, same-key idempotency, key-order sensitivity, and round-trip for 2- and 3-column keys. Also covers stride-path reverseCompositeId round-trip. Add isHashMapPath() predicate to DictIdsWrapper for test introspection (avoids widening _strides visibility). - Add SortedAggregationResultTest with multi-key extraction scenarios. - Fix SortedAggregationResult.extractResult(): clear _secondaryKeySteps after flushMultiKeyGroup() so a second call (defensive) returns zeros rather than double-counting the last open primary group.

tarun11Mavani · 2026-06-15T08:14:48Z

cc @darioliberman @rohityadav1993

codecov-commenter · 2026-06-15T08:26:29Z

Codecov Report

❌ Patch coverage is 49.16667% with 122 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.75%. Comparing base (7b1d115) to head (ab97ed9).
⚠️ Report is 8 commits behind head on master.

Files with missing lines	Patch %	Lines
...gregation/function/funnel/AggregationStrategy.java	18.29%	64 Missing and 3 partials ⚠️
...n/function/funnel/SetResultExtractionStrategy.java	20.00%	11 Missing and 1 partial ⚠️
...unction/funnel/BitmapResultExtractionStrategy.java	21.42%	10 Missing and 1 partial ⚠️
...ry/aggregation/function/funnel/DictIdsWrapper.java	85.50%	9 Missing and 1 partial ⚠️
...unction/funnel/ThetaSketchAggregationStrategy.java	0.00%	8 Missing ⚠️
...ation/function/funnel/SortedAggregationResult.java	87.80%	3 Missing and 2 partials ⚠️
...ion/function/funnel/BitmapAggregationStrategy.java	0.00%	3 Missing ⚠️
...n/funnel/FunnelCountSortedAggregationFunction.java	40.00%	1 Missing and 2 partials ⚠️
...ion/function/funnel/SortedAggregationStrategy.java	0.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18760      +/-   ##
============================================
+ Coverage     64.67%   64.75%   +0.08%     
  Complexity     1309     1309              
============================================
  Files          3381     3380       -1     
  Lines        209821   209784      -37     
  Branches      32805    32850      +45     
============================================
+ Hits         135697   135843     +146     
+ Misses        63230    63013     -217     
- Partials      10894    10928      +34

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.75% <49.16%> (+0.08%)`	⬆️
temurin	`64.75% <49.16%> (+0.08%)`	⬆️
unittests	`64.75% <49.16%> (+0.08%)`	⬆️
unittests1	`56.94% <49.16%> (-0.05%)`	⬇️
unittests2	`37.22% <0.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

tarun11Mavani and others added 2 commits June 15, 2026 06:02

Remove benchmark file from PR

135d04d

Benchmark was used for local validation only; not needed in the PR. Co-authored-by: Cursor <cursoragent@cursor.com>

Preserve original add() signature for backward compatibility

d6bb092

Keep the original `add(Dictionary, A, int, int)` abstract method unchanged. The new multi-key method is added as `addMultiKey(A, int, Dictionary[], int[])`. Co-authored-by: Cursor <cursoragent@cursor.com>

tarun11Mavani force-pushed the funnel-multi-correlate-keys branch from e1d2196 to d6bb092 Compare June 15, 2026 06:41

tarun11Mavani marked this pull request as ready for review June 15, 2026 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend FUNNEL_COUNT to support multiple CORRELATE_BY columns#18760

Extend FUNNEL_COUNT to support multiple CORRELATE_BY columns#18760
tarun11Mavani wants to merge 4 commits into
apache:masterfrom
tarun11Mavani:funnel-multi-correlate-keys

tarun11Mavani commented Jun 15, 2026 •

edited

Loading

Uh oh!

tarun11Mavani commented Jun 15, 2026 •

edited

Loading

Uh oh!

tarun11Mavani commented Jun 15, 2026

Uh oh!

codecov-commenter commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tarun11Mavani commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Example Query

Test Plan

Uh oh!

tarun11Mavani commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Validation (JMH)

Uh oh!

tarun11Mavani commented Jun 15, 2026

Uh oh!

codecov-commenter commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tarun11Mavani commented Jun 15, 2026 •

edited

Loading

tarun11Mavani commented Jun 15, 2026 •

edited

Loading

codecov-commenter commented Jun 15, 2026 •

edited

Loading