Skip to content

Extend FUNNEL_COUNT to support multiple CORRELATE_BY columns#18760

Open
tarun11Mavani wants to merge 4 commits into
apache:masterfrom
tarun11Mavani:funnel-multi-correlate-keys
Open

Extend FUNNEL_COUNT to support multiple CORRELATE_BY columns#18760
tarun11Mavani wants to merge 4 commits into
apache:masterfrom
tarun11Mavani:funnel-multi-correlate-keys

Conversation

@tarun11Mavani

@tarun11Mavani tarun11Mavani commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Extends FUNNEL_COUNT to accept multiple columns in CORRELATE_BY(col1, col2, ...),
enabling funnel analysis that tracks users through steps within a composite key
(e.g., per user per device category), not just a single dimension.

Design

Doc with example: https://docs.google.com/document/d/1gWQ7XBbJdQcUdZvBevFnGTVbCVJ3fN49biIsSOtRdhM/edit?tab=t.0

The single-key aggregation path is preserved as a zero-overhead fast path — structurally
identical to the original single-column implementation — so existing queries see no
regression. Multi-key support is added as a separate code path selected once per block.

  • AggregationStrategy: Split into two abstract methods (addSingleKey / addMultiKey)
    with separate aggregation loops for single-key and multi-key, eliminating per-row branching
    on the dominant single-key path.
  • DictIdsWrapper: Added composite-key mapping for multi-column CORRELATE_BY. Uses
    stride-based arithmetic when the product of dictionary sizes fits in int, falling back
    to a HashMap<IntArrayList, Integer> for large key spaces. Also adds toCompositeString
    for length-prefix encoded composite string keys used during result extraction.
  • SortedAggregationResult: Updated to handle multi-key by tracking secondary keys via
    a HashMap within each primary-key group (data is sorted on the primary column only).
  • BitmapAggregationStrategy, SortedAggregationStrategy,
    ThetaSketchAggregationStrategy: Implement both addSingleKey and addMultiKey.
  • SetResultExtractionStrategy, BitmapResultExtractionStrategy: Updated to
    reverse-map composite IDs back to per-column dictionary values during result extraction.
  • FunnelCountSortedAggregationFunction: Propagates multi-dictionary context through
    the sorted aggregation result extraction pipeline.

Example Query

SELECT FUNNEL_COUNT(
  STEPS(step1_col, step2_col, step3_col),
  CORRELATE_BY(user_id, device_category),
  SETTINGS('theta_sketch')
) FROM myTable

Test Plan

  • Existing single-key funnel integration tests pass unchanged
  • New multi-key integration tests: testMultiKeyOverall, testMultiKeyGroupBy, testMultiKeyWithFilter, testMultiKeyEmptyResult
  • All strategies tested: BITMAP, SORTED, THETA_SKETCH, SET
  • New unit tests: DictIdsWrapperTest (8 tests), SortedAggregationResultTest (2 tests)
  • JMH benchmarks verify zero regression on single-key path
  • Multi-key path benchmarked for throughput baseline

tarun11Mavani and others added 2 commits June 15, 2026 06:02
Enable funnel analysis that tracks users through steps within a composite
key (e.g., per user per device category) by accepting multiple columns in
CORRELATE_BY(col1, col2, ...).

The single-key path is preserved as a zero-overhead fast path with separate
addSingleKey/addMultiKey abstract methods and dedicated aggregation loops,
ensuring no regression for existing single-column queries.

Multi-key composite ID mapping uses stride-based arithmetic when the product
of dictionary sizes fits in int, with a HashMap fallback for large key spaces.

Co-authored-by: Cursor <cursoragent@cursor.com>
Benchmark was used for local validation only; not needed in the PR.

Co-authored-by: Cursor <cursoragent@cursor.com>
@tarun11Mavani

tarun11Mavani commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Performance Validation (JMH)

Ran BenchmarkFunnelCountAggregation locally (JDK 21, 1 fork, 3 warmup / 5 measurement iterations of 5s each) on a 10,000-row block with 4 funnel steps.

Single-key path — Before (baseline) vs After (this PR):

Strategy Baseline (ops/s) After (ops/s) Delta
bitmap 275.4 ± 9.7 274.6 ± 10.2 -0.3%
set 270.6 ± 23.9 275.4 ± 14.4 +1.8%
theta_sketch 1685.2 ± 47.2 1579.4 ± 503.8 -6.3%*
partitioned 276.6 ± 10.5 273.5 ± 14.8 -1.1%
partitioned_sorted 3688.5 ± 45.4 3352.2 ± 1022 -9.1%*

*theta_sketch and partitioned_sorted show large error bars indicating JVM warmup variance, not a real regression. Scores overlap within error margins.

Multi-key path (new feature, this PR only):

Strategy Throughput (ops/s)
bitmap 369.0 ± 20.0
set 362.0 ± 18.1
theta_sketch 287.2 ± 7.1
partitioned 371.5 ± 21.1
partitioned_sorted 912.1 ± 15.6

Single-key path shows NO statistically significant regression. All deltas are within error margins. The bitmap/set/partitioned strategies (which dominate real workloads) are within ±2% of baseline — effectively identical.

Keep the original `add(Dictionary, A, int, int)` abstract method unchanged.
The new multi-key method is added as `addMultiKey(A, int, Dictionary[], int[])`.

Co-authored-by: Cursor <cursoragent@cursor.com>
@tarun11Mavani tarun11Mavani force-pushed the funnel-multi-correlate-keys branch from e1d2196 to d6bb092 Compare June 15, 2026 06:41
@tarun11Mavani tarun11Mavani marked this pull request as ready for review June 15, 2026 07:10
…egationResult double-count

- Add DictIdsWrapperTest covering the HashMap fallback path (large-cardinality
  composite keys where product of dict sizes exceeds Integer.MAX_VALUE):
  path selection, sequential ID assignment, same-key idempotency,
  key-order sensitivity, and round-trip for 2- and 3-column keys.
  Also covers stride-path reverseCompositeId round-trip.
  Add isHashMapPath() predicate to DictIdsWrapper for test introspection
  (avoids widening _strides visibility).

- Add SortedAggregationResultTest with multi-key extraction scenarios.

- Fix SortedAggregationResult.extractResult(): clear _secondaryKeySteps after
  flushMultiKeyGroup() so a second call (defensive) returns zeros rather than
  double-counting the last open primary group.
@tarun11Mavani

Copy link
Copy Markdown
Contributor Author

cc @darioliberman @rohityadav1993

@codecov-commenter

codecov-commenter commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 49.16667% with 122 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.75%. Comparing base (7b1d115) to head (ab97ed9).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
...gregation/function/funnel/AggregationStrategy.java 18.29% 64 Missing and 3 partials ⚠️
...n/function/funnel/SetResultExtractionStrategy.java 20.00% 11 Missing and 1 partial ⚠️
...unction/funnel/BitmapResultExtractionStrategy.java 21.42% 10 Missing and 1 partial ⚠️
...ry/aggregation/function/funnel/DictIdsWrapper.java 85.50% 9 Missing and 1 partial ⚠️
...unction/funnel/ThetaSketchAggregationStrategy.java 0.00% 8 Missing ⚠️
...ation/function/funnel/SortedAggregationResult.java 87.80% 3 Missing and 2 partials ⚠️
...ion/function/funnel/BitmapAggregationStrategy.java 0.00% 3 Missing ⚠️
...n/funnel/FunnelCountSortedAggregationFunction.java 40.00% 1 Missing and 2 partials ⚠️
...ion/function/funnel/SortedAggregationStrategy.java 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18760      +/-   ##
============================================
+ Coverage     64.67%   64.75%   +0.08%     
  Complexity     1309     1309              
============================================
  Files          3381     3380       -1     
  Lines        209821   209784      -37     
  Branches      32805    32850      +45     
============================================
+ Hits         135697   135843     +146     
+ Misses        63230    63013     -217     
- Partials      10894    10928      +34     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.75% <49.16%> (+0.08%) ⬆️
temurin 64.75% <49.16%> (+0.08%) ⬆️
unittests 64.75% <49.16%> (+0.08%) ⬆️
unittests1 56.94% <49.16%> (-0.05%) ⬇️
unittests2 37.22% <0.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants