Skip to content

PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393

Merged
wgtmac merged 5 commits into
apache:masterfrom
wgtmac:PARQUET-2249
Jun 25, 2026
Merged

PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393
wgtmac merged 5 commits into
apache:masterfrom
wgtmac:PARQUET-2249

Conversation

@wgtmac

@wgtmac wgtmac commented Feb 12, 2026

Copy link
Copy Markdown
Member

Rationale for this change

This implements the parquet-format IEEE 754 total order column-order work from apache/parquet-format#514 in parquet-java. The existing type-defined order remains the default.

While adding the new order, this also fixes floating-point NaN handling so parquet-java preserves the exact FLOAT, DOUBLE, and FLOAT16 bit patterns supplied by applications, including NaN sign and payload bits. Filters are updated to avoid false negatives when NaN semantics cannot be answered safely from metadata.

What changes are included in this PR?

  • Add IEEE_754_TOTAL_ORDER support for FLOAT, DOUBLE, and FLOAT16, including comparators that distinguish -0/+0 and NaN bit patterns.
  • Add floating-point nan_count support in statistics and column indexes, with IEEE-754-specific statistics implementations.
  • Preserve raw floating-point bit patterns in parquet-java floating encodings, including dictionary and byte-stream-split paths.
  • Update statistics, column-index, dictionary, bloom-filter, and record-level filtering so NaN values are handled conservatively where required.
  • Add coverage for statistics, column indexes, dictionary filtering, bloom filtering, record-level filtering, Float16, and parquet-testing interop files.

Are these changes tested?

Yes. This PR adds focused unit and end-to-end tests for the new column order, NaN counts, NaN filtering behavior, raw-bit preservation, and interop with parquet-testing files.

Are there any user-facing changes?

Yes. Users can opt into IEEE_754_TOTAL_ORDER for floating columns. The default column order is unchanged, and existing files remain readable. Filtering around NaN values may be more conservative to avoid dropping data that can still match.

Closes #406

@wgtmac wgtmac force-pushed the PARQUET-2249 branch 2 times, most recently from c01b3f3 to 4b7e86b Compare March 6, 2026 15:27
@wgtmac wgtmac force-pushed the PARQUET-2249 branch 2 times, most recently from 133fb4b to 07a4d77 Compare March 14, 2026 15:43

@shangxinli shangxinli left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gang for picking this up and driving it forward!

+1 on the approach. This combined solution addresses both the ordering ambiguity and the NaN pollution concern pragmatically. Looking forward to seeing the arrow-cpp PoC as well.

@etseidl

etseidl commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

@wgtmac thank you for adding the interop test! 🙏 In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns.

Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0. The latter is to make sure that the old rules regarding 0 min being set to -0 and 0 max set +0 are no longer followed with the new ordering.

@wgtmac

wgtmac commented Apr 3, 2026

Copy link
Copy Markdown
Member Author

In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns.

Did you mean arrow-rs will no longer write floats with the legacy TypeDefinedOrder? From the perspective of interoperability test, I think this is fine if it does not fail when reading files produced by other writers.

Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0

That's a good suggestion! I have updated the floating-point interop coverage to add explicit ZERO_MIN and ZERO_MAX cases, so we now verify that IEEE-754 total order no longer rewrites +0 min to -0 or -0 max to +0. I also expanded the NaN coverage to include both negative and positive NaN patterns.

While debugging the test, I found that the Java implementation uses Float.floatToIntBits instead of Float.floatToRawIntBits (same for double) which canonicalizes NaN bits and pollutes both values and stats. I fixed the float/double write paths to preserve raw NaN bits instead of canonicalizing them.

@etseidl

etseidl commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Thank you. I hope to have the rust tests done today.

@wgtmac

wgtmac commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

@shangxinli @etseidl Do you want to take a look again? I think now everything is ready on my end. cc @gszadovszky @Fokko

@gszadovszky gszadovszky left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of comments.

Also, it would be nice to add coverage for predicate filtering with signed NaN values. Since NaN value are excluded from the statistics, this is only for the value level (ValueInspector). Can be either covered unit or integration level.

Comment thread parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java Outdated
@wgtmac

wgtmac commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Thanks @gszadovszky for the review! I think I've addressed all your comments. Let me know what you think.

@gszadovszky

Copy link
Copy Markdown
Contributor

Thank you, @wgtmac. One more thing.

This PR also changes how NaN values are encoded in dictionaries. I'm not sure if it is tightly related to IEEE-754, and is a breaking change. Before, we have lost the original NaN value during the dictionary encoding, and encoded every NaNs into the same bit pattern (e.g. 0x7ff8000000000000L for doubles). With this change, we preserve the actual bit pattern, whatever it means. We can say this is a bug fix, but maybe, we should discuss this behavioral change with the community first.

If we move to that direction, we also need to fix the dictionary filter. We use a boxed HashSet there, so even if the dictionary itself preserves the original bit patterns, the filter itself does not.

@wgtmac

wgtmac commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

@gszadovszky That's a good question! I think the main behavior change is that now original NaN bits are preserved in not only dictionary but also encoded values. I would regard this as a benign bug fix.

For dictionary filter and bloom filter on the read path, they are not aware of column order. Introducing IEEE754 total order is anyway a breaking change to them because raw bits of NaNs must be preserved as is.

Update: I've sent https://lists.apache.org/thread/m6j8lzc09ytyd45wt6pdcyn5qy95f0vt to discuss this.

shangxinli
shangxinli previously approved these changes Jun 22, 2026
@shangxinli shangxinli dismissed their stale review June 22, 2026 14:38

The discussion is still going on.

@wgtmac

wgtmac commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

I updated the patch so the pruning filters and record-level filters have separate semantics:

  • Dictionary and Bloom filters now handle NaN literals conservatively and return “might match”. So we don't need to worry about different NaN semantics and boxed NaN values in the dictionary.
  • Record-level filtering does not special-case NaN. It uses the column’s PrimitiveComparator, so the behavior follows the declared column order: TYPE_DEFINED_ORDER treats NaNs according to the existing type-defined comparator semantics, while IEEE_754_TOTAL_ORDER compares the raw floating-point bits. The same applies to +0/-0.

I still think that preserving raw bits of NaN values is a bug fix and benign breaking change. Let me know what you think. @gszadovszky

@gszadovszky

Copy link
Copy Markdown
Contributor

Thanks @wgtmac for the updates. I think one issue remains with NaNs in the dictionary. We "canonicalize" the NaN values in the dictionary filter. So, if there are NaN and -NaN values in the dictionary, and the filter has a range predicate (e.g. > 0), we may drop the row group even if there should be a match based on the total ordering. We probably need to allow a "might match" explicitly if any NaN values are in the dictionary.

@wgtmac

wgtmac commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

That's my oversight and thanks @gszadovszky for catching this! Now it should be fixed as well.

@gszadovszky gszadovszky left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @wgtmac. LGTM.

@wgtmac wgtmac merged commit 071a841 into apache:master Jun 25, 2026
3 checks passed
@wgtmac

wgtmac commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

Merged. Thanks @etseidl @shangxinli @gszadovszky for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants