[fix](maxcompute) Estimate write block size from Arrow buffers, not per-row serialization#64612
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 28924 ms |
TPC-DS: Total hot run time: 175241 ms |
ClickBench: Total hot run time: 25.24 s |
|
/review |
There was a problem hiding this comment.
Requesting changes for one correctness issue in the MaxCompute writer.
Critical checkpoint conclusions:
- Goal/test coverage: The PR intends to replace per-row serialized size estimation with batch buffer-size estimation. The faster estimate itself is clear, but the current control flow no longer proves the write-block threshold is respected. I did not find a test covering rollover when an incoming JNI batch pushes a non-empty MaxCompute block past
mc.write_max_block_bytes. - Scope/focus: The change is localized to
MaxComputeJniWriter.java, but it removes the only intra-batch row-range splitting in the MaxCompute write path. - Concurrency/lifecycle/config/compatibility: No new concurrency or lifecycle issue found. No new config item is added. The existing
mc.write_max_block_bytescontract is affected by the new rotation timing. - Parallel paths: BE forwards each materialized block to the Java writer through
VJniFormatTransformer/VMCPartitionWriter; I did not find an upstream byte-splitting path that compensates for this removal. - Tests/style:
git diff --checkpasses for the changed file. I did not run the MaxCompute connector test suite in this review environment.
User focus: no additional user-provided focus was supplied.
Subagent conclusions: OR-1 was accepted and became the inline comment below. TSC-1 reported the same issue and was merged as a duplicate of OR-1/M-1. Convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same ledger and proposed comment set.
| // Discard bytes while allowing WriteChannel to track payload size. | ||
| long batchBytes; | ||
| try (VectorSchemaRoot root = buildRowRangeRoot(inputTable, numCols, 0, numRows)) { | ||
| batchBytes = estimateBatchPayloadBytes(root); |
There was a problem hiding this comment.
This no longer enforces mc.write_max_block_bytes for the block being written. writeBatch only rotates when the current block is already over the limit, then estimates batchBytes, writes the whole VectorSchemaRoot, and adds the bytes afterward. With a 64MB threshold, a current block at 60MB and a 20MB incoming JNI block will be committed as an 80MB MaxCompute block; rotation happens only before the next batch. The removed code split the input row range and rotated before appending rows that would overflow the current block, and the BE side does not do any byte splitting before calling this Java writer. Please check currentBlockWrittenBytes + batchBytes before writing and split oversized incoming batches into row ranges, preserving the previous single-row oversize fallback.
…er-row serialization The old per-row estimateSingleRowPayloadBytes ZSTD-serialized a one-row batch for every row (CPU-heavy and ~25x oversized); sum FieldVector.getBufferSize() over the whole batch instead, and rotate the block lazily.
2fd10f3 to
eccadf7
Compare
|
/review |
|
run buildall |
TPC-H: Total hot run time: 29069 ms |
There was a problem hiding this comment.
Requesting changes for one remaining correctness/performance issue in the MaxCompute writer split path.
Critical checkpoint conclusions:
- Goal and tests: the PR aims to reduce MaxCompute writer block-size estimation cost while preserving
mc.write_max_block_bytessplitting. The helper tests coverfindPartialRowRange, but they do not prove the actualwriteBatchpath avoids materializing an oversized incoming block before splitting. - Scope: the code change is localized to
MaxComputeJniWriterplus one unit test, but the new fast-path estimate is placed before a bounded range is chosen. - Concurrency and lifecycle: no new concurrency was introduced; the existing open/write/close lifecycle and commit-message collection remain the relevant lifecycle path.
- Configuration and compatibility: no new config or wire/storage format is added. The existing
mc.write_max_block_bytessetting remains the behavioral contract affected here. - Parallel paths: BE sends each block to the Java JNI writer through
VJniFormatTransformer.write()without doing byte-based splitting first, so Java-sidewriteBatchmust bound the Arrow materialization itself. - Tests and CI: I did not run Maven/build tests because
thirdparty/installed/bin/protocis missing in this checkout, andfe/AGENTS.mdsays to stop build setup in that case. - Observability, transactions, persistence, data visibility: no new persistence or transaction-log path is modified. The issue can fail before MaxCompute commit/close, so additional observability is secondary to fixing the allocation order.
- Performance: the accepted inline issue is a peak allocation/copying regression for large BE blocks that need splitting.
Subagent conclusions: optimizer-rewrite reported no new optimizer/rewrite findings. tests-session-config proposed TSC-1, which was independently verified and accepted as MAIN-1. The existing block-overflow GitHub thread was treated as already-known context, not duplicated. Convergence round 1 ended with both subagents returning NO_NEW_VALUABLE_FINDINGS for the final ledger/comment set.
| private void writeRowsWithRowChecks(VectorTable inputTable, int numRows, int numCols) throws IOException { | ||
| private void writeBatch(VectorTable inputTable, int numRows, int numCols) throws IOException { | ||
| int rowStart = 0; | ||
| while (rowStart < numRows) { |
There was a problem hiding this comment.
The split path still builds an Arrow root for the entire remaining JNI block before it knows whether the block fits. rowEnd is set to numRows here and buildRowRangeRoot(inputTable, numCols, rowStart, rowEnd) allocates/fills vectors for every remaining row before estimateBatchPayloadBytes can fall through to findPartialRowRange. Since VJniFormatTransformer.write() passes the whole BE block to Java in one VectorTable, a large incoming block that should be split by mc.write_max_block_bytes still has to be copied into one full-size VectorSchemaRoot first, so this fix can still hit the same heap pressure it is trying to avoid. Please choose the bounded row range before materializing Arrow vectors, and add a writeBatch-level test/injection that fails if an oversized input first requests rowStart..numRows.
TPC-DS: Total hot run time: 172778 ms |
TPC-H: Total hot run time: 28526 ms |
TPC-DS: Total hot run time: 172100 ms |
ClickBench: Total hot run time: 25.2 s |
FE UT Coverage ReportIncrement line coverage `` 🎉 |
|
/review |
|
run buildall |
|
Codex automated review failed and did not complete. Error: You've hit your usage limit. Visit https://chatgpt.com/codex/settings/usage to purchase more credits or try again at 2:55 PM. Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
TPC-H: Total hot run time: 29053 ms |
TPC-DS: Total hot run time: 173106 ms |
ClickBench: Total hot run time: 25.28 s |
|
/review |
There was a problem hiding this comment.
Reviewed the current diff for PR 64612.
I did not find any new non-duplicate inline issue beyond the two existing MaxCompute block-size/materialization threads:
discussion_r3449574050covers enforcingmc.write_max_block_bytesagainst the current block before writing and splitting incoming batches.discussion_r3458025412covers avoiding oversized Arrow-root materialization before determining a bounded row range, plus the missingwriteBatch-level oversized-input test/injection.
Code-review checkpoint conclusions:
- Goal/test: the PR is focused on MaxCompute JNI writer block splitting and adds unit coverage for helper logic, but the existing threads still own the remaining writeBatch-level behavioral concerns.
- Scope/focus: changed surface is limited to
MaxComputeJniWriterand its unit test. - Concurrency/lifecycle: no new concurrent path or special static lifecycle was introduced; writer state remains per writer instance.
- Configuration/session propagation: no new config was added;
mc.write_max_block_bytescontinues through catalog properties into BE writer params and then the Java writer. - Compatibility/persistence/transactions: no storage format, function symbol, edit log, or transaction persistence change was introduced.
- Parallel paths: BE sends whole blocks through
VJniFormatTransformer; no C++ byte-splitting path exists, matching the existing review context. - Tests/results: new tests cover helper methods, not the full writeBatch path noted in the existing thread. I could not run the targeted Maven test locally because this runner lacks
thirdparty/installed/bin/thrift. - Observability/performance: no new observability requirement found; performance risk remains the existing oversized-materialization discussion.
User focus: no additional user-provided focus was supplied.
Subagent conclusions: optimizer-rewrite found no applicable optimizer/rewrite candidate. tests-session-config found no new non-duplicate candidate and confirmed the missing writeBatch-level oversized-input coverage is already covered by the existing thread. Convergence round 1 ended with both subagents reporting NO_NEW_VALUABLE_FINDINGS.
…er-row serialization (#64612) The old per-row estimateSingleRowPayloadBytes ZSTD-serialized a one-row batch for every row (CPU-heavy and ~25x oversized); sum FieldVector.getBufferSize() over the whole batch instead, and rotate the block lazily.
What problem does this PR solve?
Problem Summary:
The old per-row estimateSingleRowPayloadBytes ZSTD-serialized a one-row batch for every row (CPU-heavy and ~25x oversized); sum FieldVector.getBufferSize() over the whole batch instead, and rotate the block lazily.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)