Skip to content

[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides#56535

Draft
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:dev/chao/codex/dpp-materialized-input-correctness
Draft

[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides#56535
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:dev/chao/codex/dpp-materialized-input-correctness

Conversation

@sunchao

@sunchao sunchao commented Jun 16, 2026

Copy link
Copy Markdown
Member

Why are the changes needed?

PR #56071 extended dynamic partition pruning (DPP) eligibility to filtering plans containing a LocalRelation or a checkpoint-derived LogicalRDD. However, a materialized leaf does not make every operator above it repeatable. If a derived filtering plan contains user code, a subquery, or another non-repeatable operator, DPP may evaluate it independently from the join or may bind to a matching sibling broadcast elsewhere in the physical plan. The two evaluations can then produce different pruning keys and incorrectly remove rows from the probe side.

The checkpoint marker also records provenance rather than materialization state. A lazy checkpoint is therefore considered eligible before its first action has actually materialized and truncated the RDD lineage.

This is a follow-up to #56071. The materialized-input approach originated in #53263 (SPARK-54554) and was extended to LocalRelation and LogicalRDD in #53324 (SPARK-54593). This follow-up credits @mc8max and @dwsmith1983 as co-authors, as requested in the attribution discussion on #56071.

What changes were proposed in this PR?

  • Require a checkpoint-derived LogicalRDD to be both provenance-marked and actually materialized according to RDD.isCheckpointed.
  • For the new materialized-input eligibility path, require the complete filtering plan to be repeatable. The deliberately narrow whitelist accepts materialized leaves composed through deterministic Catalyst Project, Filter, Union, and SubqueryAlias nodes, while rejecting subqueries, user-defined/non-SQL expressions, generators, and unknown logical operators.
  • Preserve standalone DPP for safe local and checkpointed filtering plans instead of forcing all materialized inputs into broadcast-only reuse.
  • Add regressions for mixed materialization, non-repeatable mapPartitions, scalar subqueries, standalone DPP, lazy checkpoint materialization, and the sibling-broadcast wrong-result shape with adaptive execution both disabled and enabled.

This changes only behavior introduced on unreleased master: unsafe derived materialized plans no longer receive DPP, while repeatable materialized plans retain the optimization.

Generated-by: OpenAI Codex

How was this PR tested?

  • build/sbt 'sql/testOnly org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOn' (82 passed, 2 ignored)
  • build/sbt 'sql/testOnly org.apache.spark.sql.DatasetSuite -- -z "Dataset.checkpoint() - basic"' (4 passed)
  • build/sbt sql/scalastyle sql/Test/scalastyle (0 errors and 0 warnings)

Co-authored-by: Tri Tam Hoang <tritam.hoang@gmail.com>
Co-authored-by: Dustin Smith <Dustin.William.Smith@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant