PARQUET-2249: Add IEEE-754 total order and nan count for floating types by wgtmac · Pull Request #3393 · apache/parquet-java

wgtmac · 2026-02-12T16:51:33Z

Rationale for this change

This implements the parquet-format IEEE 754 total order column-order work from apache/parquet-format#514 in parquet-java. The existing type-defined order remains the default.

While adding the new order, this also fixes floating-point NaN handling so parquet-java preserves the exact FLOAT, DOUBLE, and FLOAT16 bit patterns supplied by applications, including NaN sign and payload bits. Filters are updated to avoid false negatives when NaN semantics cannot be answered safely from metadata.

What changes are included in this PR?

Add IEEE_754_TOTAL_ORDER support for FLOAT, DOUBLE, and FLOAT16, including comparators that distinguish -0/+0 and NaN bit patterns.
Add floating-point nan_count support in statistics and column indexes, with IEEE-754-specific statistics implementations.
Preserve raw floating-point bit patterns in parquet-java floating encodings, including dictionary and byte-stream-split paths.
Update statistics, column-index, dictionary, bloom-filter, and record-level filtering so NaN values are handled conservatively where required.
Add coverage for statistics, column indexes, dictionary filtering, bloom filtering, record-level filtering, Float16, and parquet-testing interop files.

Are these changes tested?

Yes. This PR adds focused unit and end-to-end tests for the new column order, NaN counts, NaN filtering behavior, raw-bit preservation, and interop with parquet-testing files.

Are there any user-facing changes?

Yes. Users can opt into IEEE_754_TOTAL_ORDER for floating columns. The default column order is unchanged, and existing files remain readable. Filtering around NaN values may be more conservative to avoid dropping data that can still match.

Closes #406

shangxinli

Thanks Gang for picking this up and driving it forward!

+1 on the approach. This combined solution addresses both the ordering ambiguity and the NaN pollution concern pragmatically. Looking forward to seeing the arrow-cpp PoC as well.

etseidl · 2026-04-02T16:14:32Z

@wgtmac thank you for adding the interop test! 🙏 In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns.

Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0. The latter is to make sure that the old rules regarding 0 min being set to -0 and 0 max set +0 are no longer followed with the new ordering.

wgtmac · 2026-04-03T02:05:07Z

In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns.

Did you mean arrow-rs will no longer write floats with the legacy TypeDefinedOrder? From the perspective of interoperability test, I think this is fine if it does not fail when reading files produced by other writers.

Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0

That's a good suggestion! I have updated the floating-point interop coverage to add explicit ZERO_MIN and ZERO_MAX cases, so we now verify that IEEE-754 total order no longer rewrites +0 min to -0 or -0 max to +0. I also expanded the NaN coverage to include both negative and positive NaN patterns.

While debugging the test, I found that the Java implementation uses Float.floatToIntBits instead of Float.floatToRawIntBits (same for double) which canonicalizes NaN bits and pollutes both values and stats. I fixed the float/double write paths to preserve raw NaN bits instead of canonicalizing them.

etseidl · 2026-04-03T14:19:03Z

Thank you. I hope to have the rust tests done today.

wgtmac · 2026-06-15T01:45:35Z

@shangxinli @etseidl Do you want to take a look again? I think now everything is ready on my end. cc @gszadovszky @Fokko

gszadovszky

Added a couple of comments.

Also, it would be nice to add coverage for predicate filtering with signed NaN values. Since NaN value are excluded from the statistics, this is only for the value level (ValueInspector). Can be either covered unit or integration level.

wgtmac · 2026-06-17T06:34:13Z

Thanks @gszadovszky for the review! I think I've addressed all your comments. Let me know what you think.

gszadovszky · 2026-06-17T07:51:29Z

Thank you, @wgtmac. One more thing.

This PR also changes how NaN values are encoded in dictionaries. I'm not sure if it is tightly related to IEEE-754, and is a breaking change. Before, we have lost the original NaN value during the dictionary encoding, and encoded every NaNs into the same bit pattern (e.g. 0x7ff8000000000000L for doubles). With this change, we preserve the actual bit pattern, whatever it means. We can say this is a bug fix, but maybe, we should discuss this behavioral change with the community first.

If we move to that direction, we also need to fix the dictionary filter. We use a boxed HashSet there, so even if the dictionary itself preserves the original bit patterns, the filter itself does not.

wgtmac · 2026-06-17T09:42:46Z

@gszadovszky That's a good question! I think the main behavior change is that now original NaN bits are preserved in not only dictionary but also encoded values. I would regard this as a benign bug fix.

For dictionary filter and bloom filter on the read path, they are not aware of column order. Introducing IEEE754 total order is anyway a breaking change to them because raw bits of NaNs must be preserved as is.

Update: I've sent https://lists.apache.org/thread/m6j8lzc09ytyd45wt6pdcyn5qy95f0vt to discuss this.

The discussion is still going on.

wgtmac · 2026-06-24T12:47:09Z

I updated the patch so the pruning filters and record-level filters have separate semantics:

Dictionary and Bloom filters now handle NaN literals conservatively and return “might match”. So we don't need to worry about different NaN semantics and boxed NaN values in the dictionary.
Record-level filtering does not special-case NaN. It uses the column’s PrimitiveComparator, so the behavior follows the declared column order: TYPE_DEFINED_ORDER treats NaNs according to the existing type-defined comparator semantics, while IEEE_754_TOTAL_ORDER compares the raw floating-point bits. The same applies to +0/-0.

I still think that preserving raw bits of NaN values is a bug fix and benign breaking change. Let me know what you think. @gszadovszky

gszadovszky · 2026-06-24T14:38:40Z

Thanks @wgtmac for the updates. I think one issue remains with NaNs in the dictionary. We "canonicalize" the NaN values in the dictionary filter. So, if there are NaN and -NaN values in the dictionary, and the filter has a range predicate (e.g. > 0), we may drop the row group even if there should be a match based on the total ordering. We probably need to allow a "might match" explicitly if any NaN values are in the dictionary.

wgtmac · 2026-06-25T04:53:15Z

That's my oversight and thanks @gszadovszky for catching this! Now it should be fixed as well.

gszadovszky

Thank you, @wgtmac. LGTM.

wgtmac · 2026-06-25T08:18:24Z

Merged. Thanks @etseidl @shangxinli @gszadovszky for the review!

wgtmac force-pushed the PARQUET-2249 branch 2 times, most recently from c01b3f3 to 4b7e86b Compare March 6, 2026 15:27

wgtmac force-pushed the PARQUET-2249 branch 2 times, most recently from 133fb4b to 07a4d77 Compare March 14, 2026 15:43

shangxinli reviewed Mar 15, 2026

View reviewed changes

Comment thread parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java

etseidl reviewed Mar 17, 2026

View reviewed changes

Comment thread parquet-column/src/main/java/org/apache/parquet/column/statistics/FloatStatistics.java Outdated

etseidl mentioned this pull request Mar 20, 2026

PARQUET-2249: Introduce IEEE 754 total order & NaN-counts apache/parquet-format#514

Merged

wgtmac force-pushed the PARQUET-2249 branch from 9694db6 to 68ed638 Compare March 28, 2026 06:04

wgtmac mentioned this pull request Apr 3, 2026

Add test file for floating types with different column orders and nan_counts apache/parquet-testing#104

Merged

wgtmac force-pushed the PARQUET-2249 branch from c5dcd8c to 046239f Compare June 13, 2026 12:27

gszadovszky requested changes Jun 15, 2026

View reviewed changes

wgtmac force-pushed the PARQUET-2249 branch from f49705f to 71ed682 Compare June 17, 2026 05:01

shangxinli previously approved these changes Jun 22, 2026

View reviewed changes

wgtmac mentioned this pull request Jun 23, 2026

GH-3632: Fix record-level notIn evaluation #3631

Merged

wgtmac added 3 commits June 24, 2026 14:27

PARQUET-2249: Add IEEE-754 total order and nan count for floating types

aaeabd0

add interop test against parquet-testing

f9d4ad2

address comment

39cd40a

wgtmac force-pushed the PARQUET-2249 branch from 71ed682 to 39cd40a Compare June 24, 2026 06:31

deal with nan filter

7c77c0e

fix dictionary filter

6e93823

wgtmac force-pushed the PARQUET-2249 branch from fd9d801 to 6e93823 Compare June 25, 2026 03:35

gszadovszky approved these changes Jun 25, 2026

View reviewed changes

wgtmac merged commit 071a841 into apache:master Jun 25, 2026
3 checks passed

Uh oh!

Conversation

wgtmac commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

shangxinli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

etseidl commented Apr 2, 2026

Uh oh!

wgtmac commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etseidl commented Apr 3, 2026

Uh oh!

wgtmac commented Jun 15, 2026

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac commented Jun 17, 2026

Uh oh!

gszadovszky commented Jun 17, 2026

Uh oh!

wgtmac commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Jun 24, 2026

Uh oh!

gszadovszky commented Jun 24, 2026

Uh oh!

wgtmac commented Jun 25, 2026

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wgtmac commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wgtmac commented Feb 12, 2026 •

edited

Loading

shangxinli left a comment •

edited

Loading

wgtmac commented Apr 3, 2026 •

edited

Loading

wgtmac commented Jun 17, 2026 •

edited

Loading