[SPARK-54593][SQL] Enable DPP for materialized filtering sides by sunchao · Pull Request #56071 · apache/spark

sunchao · 2026-05-23T02:08:31Z

What changes were proposed in this pull request?

This PR extends dynamic partition pruning (DPP) eligibility for small, already
materialized filtering sides:

LocalRelation, which represents locally available rows.
LogicalRDD produced by checkpoint() or localCheckpoint().

Checkpoint-created LogicalRDDs carry an explicit marker so that DPP is not
enabled for arbitrary LogicalRDD inputs that may require recomputing an
upstream query. This also keeps recursive CTE and foreachBatch-constructed
inputs outside the new eligibility rule.

This supersedes the unmerged approach in #53324 as well as #53263 with narrower LogicalRDD
handling while addressing SPARK-54593.

Why are the changes needed?

DPP currently requires a filtering predicate in the build-side logical plan.
When a small filtering side is already materialized as a LocalRelation or a
checkpointed LogicalRDD, that predicate is no longer present, so Spark misses
partition pruning opportunities.

This occurs for joins where a partition expression is matched to a small set of
keys, for example concat_ws("||", hour, category) = hc_key. Although the
expression is composed only from partition columns, the partitioned scan is not
dynamically pruned when the filtering side is materialized.

Does this PR introduce any user-facing change?

Yes. Queries joining a partitioned file-source table with a small
LocalRelation or checkpointed filtering side may now perform dynamic
partition pruning and scan fewer partitions. There is no API change.

How was this patch tested?

Added positive coverage for DPP using a LocalRelation build side with an
expression over partition columns.
Added positive coverage for DPP using a localCheckpoint() build side with
the same expression form.
Added negative coverage confirming that a non-checkpointed LogicalRDD does
not become DPP-eligible.
Ran build/sbt 'sql/testOnly org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOn org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff'.
Ran build/sbt sql/scalastyle sql/Test/scalastyle.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex

MaxGekk

1 blocking, 2 non-blocking.
One design gap (InMemoryRelation exclusion); two comment-only fixes.

Design / architecture (1)

PartitionPruning.scala:208: InMemoryRelation excluded without explanation — see inline

Correctness (2)

PartitionPruning.scala:221: "with known statistics" qualifier not enforced — see inline
PartitionPruning.scala:203: "carrying source statistics" mischaracterizes the criterion — see inline

Verification

Traced the eligibility gate change: extending hasSelectivePredicateOrMaterializedInput to LocalRelation and LogicalRDD(isCheckpointedInput) is safe. The prune() method's transformUp has no double-substitution risk (the inserted Filter is not re-visited). The existing safety gates — onlyInBroadcast fallback to TrueLiteral and canReuseExchange broadcast-match in PlanDynamicPruningFilters — are unchanged; with the default REUSE_BROADCAST_ONLY=true, a non-broadcast input degrades to no pruning, which is semantically identical to the original plan.

MaxGekk · 2026-06-10T11:18:17Z

+   * as a broadcast build side for DPP without evaluating an upstream computation again.
   */
-  private def hasSelectivePredicate(plan: LogicalPlan): Boolean = {
+  private def hasSelectivePredicateOrMaterializedInput(plan: LogicalPlan): Boolean = {


InMemoryRelation (the result of df.cache() / df.persist()) is also a materialized input that does not require upstream recomputation, yet it is excluded from this check. The same file already treats InMemoryRelation as zero-overhead in calculatePlanOverhead, so it is acknowledged as already-materialized for cost purposes.

In practice: sampledKeys.cache() followed by a partitioned join does NOT get DPP without a filter; sampledKeys.localCheckpoint() in the same pattern DOES — despite both being materialized. Is this intentional?

If so, the Scaladoc should document why (one plausible rationale: InMemoryRelation can be evicted under memory pressure and re-executed, while a checkpoint is immutable for the query's lifetime). If not, adding case _: InMemoryRelation => true would make the behavior consistent.

MaxGekk · 2026-06-10T11:18:17Z

   * meet the following requirements:
   *   (1) it can not be a stream
-   *   (2) it needs to contain a selective predicate used for filtering
+   *   (2) it needs to contain a selective predicate or a materialized input with known statistics


"with known statistics" is not enforced by the code — the actual gate is isCheckpointedInput, a boolean flag that is independent of statistics availability. A checkpointed LogicalRDD with originStats = None still qualifies (falling back to defaultSizeInBytes in computeStats()). The qualifier is misleading.

Suggested change

* (2) it needs to contain a selective predicate or a materialized input with known statistics

* (2) it needs to contain a selective predicate or a materialized input

MaxGekk · 2026-06-10T11:18:17Z


  /**
-   * Search a filtering predicate in a given logical plan
+   * Search for a selective filtering operation or a precomputed input carrying source statistics.


"carrying source statistics" mischaracterizes the criterion. The function checks materialization (LocalRelation holds in-memory rows; LogicalRDD.isCheckpointedInput flags a checkpoint), not statistics quality. The second sentence already captures the real property correctly: "without evaluating an upstream computation again."

Suggested change

* Search for a selective filtering operation or a precomputed input carrying source statistics.

* Search for a selective filtering operation or a precomputed, materialized input.

MaxGekk

3 addressed, 0 remaining, 0 new.
All three prior-round comment findings on PartitionPruning.scala are resolved: the InMemoryRelation exclusion is now documented (eviction -> upstream recomputation), and the two mischaracterizing doc phrases ("with known statistics", "carrying source statistics") were rewritten to describe materialization rather than statistics.

Verification

Traced the widened eligibility gate. The newly eligible inputs -- LocalRelation and checkpoint-derived LogicalRDD (isCheckpointedInput) -- are deterministic materialized snapshots, so the DPP semi-join filter is redundant with the join's own key match and produces identical results. prune()'s transformUp is unchanged (no double-substitution; the inserted Filter is not re-matched). The REUSE_BROADCAST_ONLY=true default still degrades a non-broadcast filtering side to no pruning (PartitionPruning.scala:123). The fromCheckpoint marker is propagated across every copy path (newInstance/withStream/otherCopyArgs/fromDataset), set true only in Dataset.checkpoint, and explicitly false in UnionLoopExec, so recursive-CTE and arbitrary LogicalRDDs stay ineligible.

MaxGekk · 2026-06-13T19:22:49Z

+1, LGTM. Merging to master/4.x.
Thank you, @sunchao.

### What changes were proposed in this pull request? This PR extends dynamic partition pruning (DPP) eligibility for small, already materialized filtering sides: - `LocalRelation`, which represents locally available rows. - `LogicalRDD` produced by `checkpoint()` or `localCheckpoint()`. Checkpoint-created `LogicalRDD`s carry an explicit marker so that DPP is not enabled for arbitrary `LogicalRDD` inputs that may require recomputing an upstream query. This also keeps recursive CTE and `foreachBatch`-constructed inputs outside the new eligibility rule. This supersedes the unmerged approach in #53324 with narrower `LogicalRDD` handling while addressing SPARK-54593. ### Why are the changes needed? DPP currently requires a filtering predicate in the build-side logical plan. When a small filtering side is already materialized as a `LocalRelation` or a checkpointed `LogicalRDD`, that predicate is no longer present, so Spark misses partition pruning opportunities. This occurs for joins where a partition expression is matched to a small set of keys, for example `concat_ws("||", hour, category) = hc_key`. Although the expression is composed only from partition columns, the partitioned scan is not dynamically pruned when the filtering side is materialized. ### Does this PR introduce _any_ user-facing change? Yes. Queries joining a partitioned file-source table with a small `LocalRelation` or checkpointed filtering side may now perform dynamic partition pruning and scan fewer partitions. There is no API change. ### How was this patch tested? - Added positive coverage for DPP using a `LocalRelation` build side with an expression over partition columns. - Added positive coverage for DPP using a `localCheckpoint()` build side with the same expression form. - Added negative coverage confirming that a non-checkpointed `LogicalRDD` does not become DPP-eligible. - Ran `build/sbt 'sql/testOnly org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOn org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff'`. - Ran `build/sbt sql/scalastyle sql/Test/scalastyle`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenAI Codex Closes #56071 from sunchao/dev/chao/codex/dpp-materialized-build-side. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 0255283) Signed-off-by: Max Gekk <max.gekk@gmail.com>

dwsmith1983 · 2026-06-14T04:04:40Z

Hi @sunchao and @MaxGekk , thanks for getting this merged.

A note on attribution. The core move in this PR, making hasSelectivePredicate return true for an already-materialized leaf that has no Filter, was first introduced in #53263 (SPARK-54554) on Nov 29, 2025 (CommandResult), and extended to LocalRelation and LogicalRDD in #53324 (SPARK-54593). This PR references #53324 but not #53263, which is the earlier PR where the mechanism and the materialized-input rationale originated.

Two small asks. First, could #53263 / SPARK-54554 be referenced alongside #53324 so the lineage is on the record? Second, given the merged work generalizes the approach from those PRs, would a co-author acknowledgment for #53263 (co-authored with @mc8max and @dwsmith1983) be appropriate, whether on the Jira or in a follow-up note? I understand the commit itself is already in history; even a reference on the ticket would be great.

Thanks for considering it.

Prior PRs:

Jira:

sunchao · 2026-06-15T15:00:09Z

Thanks @MaxGekk for reviewing and merging this!!

@dwsmith1983 sure! I've updated the PR description to include #53263 . Given the PR is already committed, it's hard to update and add extra co-authors to it, but I'll add both of you if there are follow-up PRs to this, which is very likely.

sunchao marked this pull request as ready for review May 23, 2026 17:27

MaxGekk reviewed Jun 10, 2026

View reviewed changes

sunchao added 2 commits June 11, 2026 18:28

[SPARK-54593][SQL] Enable DPP for materialized filtering sides

6997a46

[SPARK-54593][SQL] Clarify DPP eligibility for materialized inputs

ad9d6e9

sunchao force-pushed the dev/chao/codex/dpp-materialized-build-side branch from 290e23d to ad9d6e9 Compare June 12, 2026 01:32

MaxGekk approved these changes Jun 13, 2026

View reviewed changes

MaxGekk closed this in 0255283 Jun 13, 2026

sunchao mentioned this pull request Jun 16, 2026

[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides #56535

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54593][SQL] Enable DPP for materialized filtering sides#56071

[SPARK-54593][SQL] Enable DPP for materialized filtering sides#56071
sunchao wants to merge 2 commits into
apache:masterfrom
sunchao:dev/chao/codex/dpp-materialized-build-side

sunchao commented May 23, 2026 •

edited

Loading

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk Jun 10, 2026

Uh oh!

MaxGekk Jun 10, 2026

Uh oh!

MaxGekk Jun 10, 2026

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk commented Jun 13, 2026

Uh oh!

dwsmith1983 commented Jun 14, 2026

Uh oh!

sunchao commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	* (2) it needs to contain a selective predicate or a materialized input with known statistics
	* (2) it needs to contain a selective predicate or a materialized input

	* Search for a selective filtering operation or a precomputed input carrying source statistics.
	* Search for a selective filtering operation or a precomputed, materialized input.

Conversation

sunchao commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Design / architecture (1)

Correctness (2)

Verification

Uh oh!

MaxGekk Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Verification

Uh oh!

MaxGekk commented Jun 13, 2026

Uh oh!

dwsmith1983 commented Jun 14, 2026

Uh oh!

sunchao commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sunchao commented May 23, 2026 •

edited

Loading