Bound FST regexp traversal and fall back to scan when the limit is exceeded by xiangfu0 · Pull Request #18759 · apache/pinot

xiangfu0 · 2026-06-15T05:28:02Z

Problem

A REGEXP_LIKE backed by an FST/IFST index walks the query automaton over the FST. A pattern that does not prune — anything with a leading .* — gives the prefix walk nothing to descend into, so it traverses essentially the whole FST (~cardinality paths) and allocates proportionally. A single such query is fine, but a burst of them saturates server heap (the EpicGames OOM). The companion fix #18754 (merged) removed the retained-result overhead and made the walk interruptible; this PR bounds the walk itself.

Benchmarks (default java.util.regex scan, 500K keys × 40 chars): for broad patterns the dictionary scan is faster and allocates ~190× less than the FST walk, while for selective prefix/fuzzy patterns the FST is ~1600× faster. So the right move is to cap the FST work and fall back to scan only when the walk is actually broad.

Approach

The matcher reports completion instead of throwing — no exception is used as control flow:

RegexpMatcher / RegexpMatcherCaseInsensitive return false once maxPaths FST paths have been visited (walk abandoned), or true if it completed.
TextIndexReader gains a backward-compatible @Nullable getDictIds(String, int maxTraversalPaths) default method (delegates to the single-arg version, so other readers are unaffected). The two Lucene readers return null when the walk is abandoned.
The FST/IFST predicate factories return null on null, and PredicateEvaluatorProvider falls back to the existing dictionary-scan evaluator — correct, and typically far cheaper in memory for broad patterns. Selective patterns visit a small subtree, never trip the cap, and keep the FST's large speed advantage.

Default cap = column cardinality

The fstRegexpTraversalLimit query option overrides the cap with an absolute value (non-positive disables it); when unset it defaults to the column cardinality (dictionary length). This is the natural, self-tuning unit: an experiment across cardinalities shows a selective query visits paths proportional to its match count (≈150 paths for 100 matches, independent of cardinality), while a non-pruning .* walk visits ≈1.1× the cardinality. So capping at cardinality makes a non-pruning pattern fall back to scan (where scan is the better tool anyway) while every selective prefix/fuzzy query stays on the FST — at any column size, with no magic constant.

cardinality	pattern	matches	paths visited
1M	`.*` (matchAll)	1,000,000	1,111,145
1M	`.<rare-suffix>` (leading `.`)	100	1,111,145
1M	selective prefix	100	149
1M	exact	1	41

No on-disk format change; the SPI addition is a default method (rolling-upgrade safe); the new query option is additive.

Testing

Matcher / reader (FSTBuilderTest, IFSTBuilderTest): a tiny budget abandons the walk (matcher returns false, reader returns null); an unbounded walk completes; a cap equal to the cardinality makes a .* walk fall back (null) while a selective prefix stays on the FST; a non-positive cap disables the bound.
End-to-end (FSTBasedRegexpLikeQueriesTest): a query with fstRegexpTraversalLimit = 1 forces the scan fallback and returns results identical to the FST path (256 and 512 rows), proving fallback correctness.

codecov-commenter · 2026-06-15T06:32:33Z

Codecov Report

❌ Patch coverage is 78.18182% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.77%. Comparing base (5617ee7) to head (81e2a62).
⚠️ Report is 10 commits behind head on master.

Files with missing lines	Patch %	Lines
...e/pinot/segment/local/utils/fst/RegexpMatcher.java	50.00%	5 Missing ⚠️
.../local/utils/fst/RegexpMatcherCaseInsensitive.java	50.00%	5 Missing ⚠️
...r/filter/predicate/PredicateEvaluatorProvider.java	92.85%	0 Missing and 1 partial ⚠️
...inot/segment/spi/index/reader/TextIndexReader.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18759   +/-   ##
=========================================
  Coverage     64.76%   64.77%           
  Complexity     1309     1309           
=========================================
  Files          3380     3380           
  Lines        209573   209587   +14     
  Branches      32805    32809    +4     
=========================================
+ Hits         135735   135755   +20     
+ Misses        62914    62904   -10     
- Partials      10924    10928    +4

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.77% <78.18%> (+<0.01%)`	⬆️
temurin	`64.77% <78.18%> (+<0.01%)`	⬆️
unittests	`64.76% <78.18%> (+<0.01%)`	⬆️
unittests1	`56.93% <63.63%> (-0.02%)`	⬇️
unittests2	`37.26% <29.09%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

xiangfu0 · 2026-06-15T07:26:11Z

CI fix

FilterPlanNodeTest mocked TextIndexReader.getDictIds(String), but the FST/IFST evaluators now call the bounded getDictIds(String, int) overload — the mock returned null for the new method, NPE-ing the evaluator. Stubbed the bounded overload in the test's mockTextIndexReader(). (All 9 FilterPlanNodeTest cases pass locally.)

How the default (100,000) was chosen

Added FstTraversalLimitExperiment (pinot-perf) which, across cardinalities, reports the exact number of FST paths each pattern visits (the unit the cap is in) alongside FST vs scan latency. 40-char keys, default java.util.regex:

cardinality	pattern	matches	paths visited	fst (ms)	scan (ms)
10K	selectivePrefix	100	149	0.24	0.73
10K	matchAll `.*`	10,000	11,147	0.93	0.95
100K	selectivePrefix	100	149	0.13	0.93
100K	matchAll `.*`	100,000	111,146	5.07	5.29
1M	exact	1	41	0.01	14.1
1M	selectivePrefix	100	149	0.15	13.5
1M	broadSuffix (leading `.*`)	100	1,111,145	38.2	75.7
1M	matchAll `.*`	1,000,000	1,111,145	40.4	52.3

Takeaways that justify 100,000:

A selective query's path count tracks its match count, not column cardinality (~150 paths for 100 matches at every cardinality) — so 100K leaves a ~1000× margin and never trips selective prefix/fuzzy queries.
A full/leading-.* walk visits ~1.1× cardinality paths (note broadSuffix with 100 matches visits the same as matchAll — a leading .* always walks the whole FST). 100K trips these once cardinality exceeds ~90K, exactly where the FST walk stops being cheaper than a scan.
The transient allocation of a capped walk is bounded to ~15 MB.

Caveat: the fallback's latency depends on the scan regex engine. broadSuffix here ended in 0000, which java.util.regex finds instantly; a leading-.* matching a rare substring backtracks much harder and the scan fallback can be several× slower than the FST walk. Envs that rely on broad rare-substring regex should enable RE2J (pinot.server.query.regex.class) to keep the fallback fast.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a bounded FST/IFST-backed REGEXP_LIKE traversal with a configurable path budget, falling back to dictionary scan when the traversal is too broad to be memory-efficient.

Changes:

Introduces query option fstRegexpTraversalLimit (default 100_000) and plumbing to read it.
Extends TextIndexReader with a bounded getDictIds(String, int) overload and wires the traversal limit through FST/IFST regexp evaluators.
Updates regexp matchers/readers/tests and adds perf experiments/benchmarks for limit selection and validation.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java	Adds new query option key/value default for traversal limit.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/TextIndexReader.java	Adds bounded `getDictIds` overload with default implementation.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/fst/IFSTBuilderTest.java	Adds termination + traversal-limit tests for IFST matcher/reader.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/fst/FSTBuilderTest.java	Adds termination + traversal-limit tests for FST matcher/reader.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/fst/RegexpMatcherCaseInsensitive.java	Refactors matcher to stream results + enforce traversal cap.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/fst/RegexpMatcher.java	Refactors matcher to stream results + enforce traversal cap.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/fst/FSTTraversalLimitExceededException.java	Introduces unchecked exception used as a “fall back to scan” signal.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/LuceneIFSTIndexReader.java	Uses bounded matcher API and propagates traversal-limit/termination exceptions.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/LuceneFSTIndexReader.java	Uses bounded matcher API and propagates traversal-limit/termination exceptions.
pinot-perf/src/main/java/org/apache/pinot/perf/FstTraversalLimitExperiment.java	Adds exploratory experiment for choosing a sensible default limit.
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkFSTRegexpMatcher.java	Adds JMH benchmark harness for FST/IFST regexp matching.
pinot-core/src/test/java/org/apache/pinot/queries/FSTBasedRegexpLikeQueriesTest.java	Adds end-to-end test validating scan fallback correctness.
pinot-core/src/test/java/org/apache/pinot/core/plan/FilterPlanNodeTest.java	Updates mock to stub bounded `getDictIds` overload.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/PredicateEvaluatorProvider.java	Reads traversal limit option and falls back to scan on limit exception.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/IFSTBasedRegexpPredicateEvaluatorFactory.java	Passes traversal limit into bounded `getDictIds`.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/FSTBasedRegexpPredicateEvaluatorFactory.java	Passes traversal limit into bounded `getDictIds`.
pinot-common/src/main/java/org/apache/pinot/common/utils/config/QueryOptionsUtils.java	Adds query option accessor for `fstRegexpTraversalLimit`.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

A REGEXP_LIKE backed by an FST/IFST index walks the query automaton over the FST. A broad pattern (anything with a leading `.*`) over a high-cardinality column gives the prefix walk no pruning, so it visits a large fraction of the FST and allocates proportionally -- a per-query heap spike that, under concurrency, drives server OOMs. The live frontier is already bounded (DFS), but the allocation throughput is not. Bound it: the matchers now accept a `maxPaths` cap and throw FSTTraversalLimitExceededException once that many paths are visited. The FST/IFST readers expose a `getDictIds(query, maxTraversalPaths)` overload (new default method on TextIndexReader) and let the signal propagate unwrapped. PredicateEvaluatorProvider passes the configured limit and, on breach, falls back to the existing dictionary-scan evaluator -- which is correct and, for broad patterns, typically far cheaper in memory (benchmarks show scan allocates ~190x less than FST for `.*`). Selective prefix/fuzzy queries visit a small subtree and never trip the cap, so the index keeps its large win there. The cap is configurable via the `fstRegexpTraversalLimit` query option, defaulting to 100,000 visited paths. Tests: matcher-level limit (broad walk throws, generous limit completes, selective query stays under a modest budget) for both FST and IFST; and an end-to-end query test asserting that a tiny limit forces the scan fallback and returns results identical to the FST path.

xiangfu0 · 2026-06-17T05:31:23Z

close as not needed

xiangfu0 force-pushed the claude/fst-regexp-traversal-limit branch 2 times, most recently from 4ad85c5 to 044bddd Compare June 15, 2026 07:25

xiangfu0 requested review from Jackie-Jiang, Copilot and deepthi912 June 15, 2026 07:31

xiangfu0 added performance Related to performance optimization index Related to indexing (general) oom-protection Related to out-of-memory protection mechanisms labels Jun 15, 2026

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Copilot started reviewing on behalf of xiangfu0 June 15, 2026 07:41 View session

xiangfu0 requested a review from raghavyadav01 June 15, 2026 07:58

xiangfu0 force-pushed the claude/fst-regexp-traversal-limit branch 2 times, most recently from 52dc12d to f5357e4 Compare June 15, 2026 08:14

Jackie-Jiang reviewed Jun 15, 2026

View reviewed changes

Comment thread pinot-perf/src/main/java/org/apache/pinot/perf/FstTraversalLimitExperiment.java Outdated

Comment thread pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/fst/RegexpMatcher.java Outdated

xiangfu0 force-pushed the claude/fst-regexp-traversal-limit branch 2 times, most recently from 33cf491 to 0ffb895 Compare June 15, 2026 20:21

xiangfu0 requested review from Jackie-Jiang and Copilot June 15, 2026 20:48

Copilot started reviewing on behalf of xiangfu0 June 15, 2026 20:48 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/fst/RegexpMatcher.java Outdated

Comment thread ...cal/src/main/java/org/apache/pinot/segment/local/utils/fst/RegexpMatcherCaseInsensitive.java Outdated

xiangfu0 force-pushed the claude/fst-regexp-traversal-limit branch from 0ffb895 to 81e2a62 Compare June 15, 2026 20:56

xiangfu0 closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound FST regexp traversal and fall back to scan when the limit is exceeded#18759

Bound FST regexp traversal and fall back to scan when the limit is exceeded#18759
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:claude/fst-regexp-traversal-limit

xiangfu0 commented Jun 15, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 15, 2026 •

edited

Loading

Uh oh!

xiangfu0 commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

xiangfu0 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xiangfu0 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach

Default cap = column cardinality

Testing

Uh oh!

codecov-commenter commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiangfu0 commented Jun 15, 2026

CI fix

How the default (100,000) was chosen

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

xiangfu0 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiangfu0 commented Jun 15, 2026 •

edited

Loading

codecov-commenter commented Jun 15, 2026 •

edited

Loading