Skip to content

[pull] master from apache:master#31

Open
pull[bot] wants to merge 6204 commits into
kkpan11:masterfrom
apache:master
Open

[pull] master from apache:master#31
pull[bot] wants to merge 6204 commits into
kkpan11:masterfrom
apache:master

Conversation

@pull

@pull pull Bot commented Mar 12, 2025

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull Bot added the ⤵️ pull label Mar 12, 2025
Gabriel39 and others added 29 commits June 3, 2026 09:34
Problem Summary: Iceberg data location resolution skipped the legacy
object-store.path property and fell back directly to
write.folder-storage.path or table location when write.data.path was not
set. This could choose the wrong data location for tables that still
rely on object-store.path. This change keeps write.data.path as the
highest priority, then checks object-store.path before
write.folder-storage.path and the default table data directory.
Problem Summary: FileScanner kept passing raw IOContext pointers to
several file readers, so DelegateReader could still create a
shallow-copied IOContext on the hot scan path. That left different
IOContext instances inside the same reader stack and could also
dereference missing child stats pointers when an IOContext existed
without file reader stats. This change keeps FileScanner's IOContext in
a shared holder, passes it through CSV, text, JSON, native, Parquet,
ORC, and table-format reader variants, and makes Native/Parquet/ORC use
the shared DelegateReader API when a holder is available. Tracing/stat
updates now check the nested stats pointer before use.
…63894)

Currently, the update of partitions only depends on the visible version
and visible time. If a balance occurs, the version and time of the
partition will not be updated, which means that the updated partition
will not be retrieved from the remote FE. When executing a query, the
tablet on the BE node may no longer exist, resulting in query errors.
To avoid this problem, a checksum will be calculated for the partition
to determine whether the partition's metadata has changed.
…63809)

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

External catalog meta cache statistics exposed cumulative eviction
count, but did not provide a direct replacement frequency metric for
judging whether cache capacity is too small. This PR adds
`EVICTION_RATE` to `information_schema.catalog_meta_cache_statistics`,
calculated as `eviction_count / request_count` and returned as `0` when
there are no requests.

Hive partition metadata cache defaults were also too small for common
external catalog workloads, causing frequent evictions without explicit
tuning. This PR increases the default Hive single-partition cache
capacity from 10,000 to 100,000 and the Hive partitioned-table values
cache capacity from 1,000 to 10,000. While checking similar cache
entries, MaxCompute `partition_values` was found to cache table-level
partition value structures but use the Hive single-partition capacity;
it now follows the table-level partition values capacity.

### Release note

Add `EVICTION_RATE` to
`information_schema.catalog_meta_cache_statistics`, increase default
Hive partition meta cache capacities, and make MaxCompute
`partition_values` use the table-level partition values capacity.

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [x] Unit Test
- `./run-fe-ut.sh --run
org.apache.doris.datasource.metacache.MetaCacheEntryTest`
- `./run-fe-ut.sh --run
org.apache.doris.datasource.hive.HiveMetaStoreCacheTest,org.apache.doris.datasource.maxcompute.MaxComputeExternalMetaCacheTest`
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
- [x] Yes. `catalog_meta_cache_statistics` includes `EVICTION_RATE`;
default Hive partition meta cache capacities are larger; MaxCompute
`partition_values` uses the table-level partition values capacity.

- Does this need documentation?
    - [x] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
The scan operator unconditionally skipped VARBINARY
column predicate and TopN runtime predicate pushdown. The commit that
introduced the guard was for external Parquet/file scan reader predicate
limitations, so applying it in the shared scan path also blocked
non-file scans. This change adds a scan-operator hook for column
predicate pushdown capability, keeps the default permissive, and makes
FileScanOperatorX reject VARBINARY column predicates.
…3718)

Problem Summary: `DataTypeVariantSerDe::write_column_to_arrow` always
cast the Arrow builder to `arrow::StringBuilder`. During Parquet OUTFILE
export, the Arrow block converter can switch utf8 columns to
`large_utf8` when a batch is large, which gives variant serialization an
`arrow::LargeStringBuilder` and crashes BE on the bad cast.

This patch handles both `arrow::StringBuilder` and
`arrow::LargeStringBuilder` for VARIANT Arrow serialization and adds a
BE UT that reproduces the LargeStringBuilder path.
…other_func_low_ndv in agg_strategy (#64022)

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
Remove the single replica compaction (SRC) feature end-to-end across BE,
FE and regression tests.

Doc: apache/doris-website#3870

 ### Why remove it

The main reason is **correctness risk in peer selection**. A follower
replica had to pick a peer holding a "proper" version
(`_find_rowset_to_fetch`) and fetch its compacted result, based on
replica info that was only refreshed
periodically. Because replicas progress through versions independently
and this "leader" selection ran against a stale, time-sensitive view of
the cluster, the hoice of which peer to fetch from — and which version —
was racy and could select a peer whose state no longer matched, leading
to subtle inconsistencies.
### What problem does this PR solve?

Issue Number: None

Problem Summary:

Remove dead helper code from BE JSON-related implementations:

- Remove the unused `ExecuteReducer` template and its `JsonParser`/path
parsing helper chain from `function_json.cpp`.
- Remove the unused `convert_jsonb_to_rapidjson` declaration/definition
after its only live dependency was removed.
- Remove the commented-out test helper that referenced the deleted
conversion helper.
- Clean up now-unused includes and make small style cleanups around the
touched code.

This is an internal cleanup only and does not change JSON function
behavior.

### Release note

None

### Check List (For Author)

- Test: Manual test
- `ninja -C be/ut_build_ASAN
src/core/CMakeFiles/Core.dir/data_type_serde/data_type_jsonb_serde.cpp.o
src/exprs/CMakeFiles/Exprs.dir/function/function_json.cpp.o
test/CMakeFiles/doris_be_test.dir/core/column/column_variant_test.cpp.o`
    - `build-support/clang-format.sh`
    - `build-support/check-format.sh`
    - `git diff --check`
- Behavior changed: No
- Does this need documentation: No
…st (#64024)

### What problem does this PR solve?

Related PR: [63506](#63506)

Problem Summary: `test_auth_remote_ip` only needs to verify that Arrow
Flight SQL remote IP authentication allows a matched user to run `SELECT
1`. The shared `sql_impl` helper uses `PreparedStatement`, and Arrow
Flight SQL JDBC 17 can report a close-time 8-byte client allocator leak
after the prepared path has already consumed the result. This changes
the case to use `JdbcUtils.executeQueryToList`, which uses
`createStatement().executeQuery(...)`, so the test avoids the prepared
statement cleanup path without ignoring `conn.close()` exceptions.
Issue Number: #48203

Related PR: #59223

doc: apache/doris-website#3891

Problem Summary:

Support function `ARRAY_CROSS_PRODUCT`

```sql
Doris> SELECT CROSS_PRODUCT([1, 2, 3], [2, 3, 4]);
+-------------------------------------+
| CROSS_PRODUCT([1, 2, 3], [2, 3, 4]) |
+-------------------------------------+
| [-1, 2, -1]                         |
+-------------------------------------+
1 row in set (0.021 sec)

Doris> SELECT CROSS_PRODUCT([1, 2, 3], NULL);
+--------------------------------+
| CROSS_PRODUCT([1, 2, 3], NULL) |
+--------------------------------+
| NULL                           |
+--------------------------------+
1 row in set (0.009 sec)

Doris> SELECT CROSS_PRODUCT([1, NULL, 3], [1, 2, 3]);
ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]function array_cross_product cannot have null
Doris> SELECT CROSS_PRODUCT([1, 2, 3, 4], [1, 2, 3, 4]);
ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]function array_cross_product requires both input arrays to have exactly 3 elements, got 4 and 4

```
### What problem does this PR solve?

Problem Summary: Replace direct typeid_cast usage for Doris column type
checks with the column-specific check_and_get_column helper. This keeps
column downcast checks consistent across core column code, expression
evaluation, storage segment code, and related table reader tests without
changing behavior.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
…ure (#63565)

`(int)(a.getId() - b.getId())` overflows when BE ID delta exceeds
Integer.MAX_VALUE, breaking the Comparator contract and causing stream
load to fail with "Comparison method violates its general contract!".
Use `Long.compare` instead. Same fix applied to CloudSystemInfoService.
Deduplicate equivalent PENDING one-shot TABLE warm up jobs by
destination cluster, normalized table set, and force flag.
Deduplicate equivalent PENDING one-shot CLUSTER warm up jobs by
source/destination cluster pair.
Reuse the oldest matching pending job and return its job id instead of
appending another pending duplicate.
Keep RUNNING jobs out of deduplication and preserve the existing
PERIODIC / EVENT_DRIVEN behavior.
Add unit tests for table/cluster deduplication, replay handling, and
regression coverage.
…ewOlapScanner (#61072)

- decouple realtime FileCache profile updates from the
local/remote-bytes branch and update once per cycle when file cache is
enabled
- reset file_cache_stats as a whole (`= {}`) after realtime reporting,
instead of resetting only bytes_read_from_local/bytes_read_from_remote
- prevent inflated FileCache profile counters caused by repeated
accumulation (e.g. LockWaitTimer, CacheGetOrSetTimer,
BytesWriteIntoCache)
… to avoid reserved-keyword parse failure (#63747)

### What problem does this PR solve?

Problem Summary:
When a routine load job uses a column name that is a SQL reserved
keyword (e.g., `group`) in a PRECEDING FILTER clause, the
Nereids-to-legacy expression translator sets the slot label as the raw
name (e.g., `group`) without quoting. When the legacy
expression SQL is later re-parsed (e.g., during routine load reparse via
`NereidsLoadUtils.parseExpressionSeq`), the unquoted
reserved keyword causes a parse failure, pausing the routine load job.

This PR quotes the slot label using `SqlUtils.getIdentSql()` so that
reserved-keyword column names are properly backtick-quoted in
the translated legacy expression SQL, preventing the parse failure.
### What problem does this PR solve?

Issue Number: close #63603

Problem Summary:

`MysqlProto.readLenEncodedString` reads a length-encoded integer and
passes it straight to `new byte[(int) length]` with no bound. The length
is fully attacker-controlled (a `0xFE` lead byte carries an 8-byte
value), and it is read before authentication from
`MysqlAuthPacket.readFrom` (the auth-response field at
`MysqlAuthPacket.java:93` and the connection-attributes loop at
`MysqlAuthPacket.java:110-118`). A small handshake response can
therefore request
a ~2 GiB allocation, and a length with the high bit set casts to a
negative size (`NegativeArraySizeException`).

This PR rejects a length that is negative or larger than the bytes
remaining in the buffer before allocating. A well-formed length-encoded
string's payload always fits in the remaining buffer, so valid input is
unaffected. One guard covers both reach paths.
…on't enable fqdn mode in fe.conf because of using dns resolve firstly but not ip directly (#62139)

### What problem does this PR solve?

improve show frontends so slow issue when we don't enable fqdn mode in
fe.conf because of using dns resolve firstly but not ip directly
### What problem does this PR solve?

`COM_RESET_CONNECTION` was accepted by Doris, but its behavior was not
compatible with MySQL. The previous implementation cleared the current
catalog/database state and returned OK after only a partial reset. This
could make pooled clients, such as C# MySqlConnector with
`ConnectionReset=True`, fail later unqualified SQL with `Current
database is not set`. Other session-scoped state, including user
variables and prepared statements, also needed to be reset consistently.

### What is changed?

- Preserve the current catalog/database state across
`COM_RESET_CONNECTION` so pooled connections can continue using the
selected database.
- Reset session variables, user variables, prepared statements, running
query state, insert result, command state, and returned row count.
- Roll back transaction state during reset and return an error if
rollback fails.
- Drop temporary tables during reset and return an error if cleanup
fails.
- Return OK with the autocommit server status when reset succeeds.
- Return the MySQL-compatible unknown prepared statement error when
executing a statement cleared by reset.
- Extend regression and FE unit coverage for reset behavior, error
handling, and current database preservation.
Related PR: #63145

Problem Summary: This re-submits the OSS mTLS framework work from #63145
under my account and rebases it onto the latest apache/doris master. The
change ports the public mTLS scaffolding, configuration, protocol
startup split, certificate-auth contracts, and TLS validation tests
while excluding enterprise module directories.

After the rebase, the previous FE UT failures were fixed: the
ALTER/CREATE USER TLS unit tests now use Mockito static mocking instead
of external JMockit parameter injection, and the MetaServiceProxy
success-path test now stubs the mock client as using the latest channel
configuration so the proxy does not replace it before executing the
request.

---------

Co-authored-by: Siyang Tang <tangsiyang@selectdb.com>
…e writes (#62880)

### What problem does this PR solve?
Related PR: #62578

1. PR #62578 moved MaxCompute write block ID allocation from BE-local
counters to
  Instead of calling FE through the BE JNI C++ bridge:

  MaxCompute connector Java -> BE JNI C++ -> FE
  the MaxCompute connector now requests FE directly through thrift:
  MaxCompute connector Java -> FE

A new MaxComputeFeClient is added under the MaxCompute connector to
handle FE
  methods.

  2. Removes the hardcoded `MAX_BLOCK_COUNT` variable from

`fe/fe-core/src/main/java/org/apache/doris/datasource/maxcompute/MCTransaction.java`
  and moves it to the FE config `max_compute_write_max_block_count`
The default value is still 20000, so the existing behavior is preserved.
### What problem does this PR solve?

Problem Summary:

Routine load lag was refreshed mainly when task scheduling needed to
recheck latest offsets after consuming the cached end offset. If
producers continued appending data while the job was running, the cached
latest offsets could become stale, so the reported routine load lag was
not real-time enough.

This PR refreshes routine load lag cache during `RoutineLoadScheduler`
rounds. The metric path still only reads in-memory state and does not
call Kafka directly.

For routine load jobs, the latest offset cache is refreshed for current
progress partitions. Concurrent updates from job scheduling and task
scheduling are handled with monotonic atomic max updates, so latest
offsets do not regress. Kafka metadata requests also use snapshots of
broker/topic/converted properties.
… refactoring #62306 and legacy issues (#62821)

### What problem does this PR solve?

Related PR: #62306

Problem Summary:
This PR fixes some issues caused by the refactoring #62306 and legacy
issues:

1. For Iceberg/Paimon systems, it's necessary to pass metadata partition
values ​​for each split. Simply relying on information from files to
obtain partition values ​​is unreliable, especially for tables migrated
from Hive.

2. Condition cache conflicts with CountReader and Lazy RF; see comments
in `be/src/exec/scan/file_scanner.cpp` for details.

3. PR #62306 omitted handling of Iceberg name_mapping.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?

Related PR: #63469

Problem Summary:

`#63469` truncates segment key bounds before storing segment statistics,
but the current implementation first copies the full `KeyBoundsPB` and
then calls `resize()` on the protobuf string fields.

For very long keys, `resize()` reduces the visible string size but may
keep the original large string capacity. After the truncated
`SegmentStatistics` is moved into `_segid_statistics_map`, the rowset
writer can still retain buffers sized for the original full key bounds.

This PR changes the write path to build the stored `SegmentStatistics`
with freshly assigned truncated key bound strings, avoiding the
full-copy-then-resize pattern. The segcompaction segment stats path is
updated in the same way.
… format (#63570)

V3 layout:
|data1..dataN|varuint_len1..varuint_lenN|data_block_size(u32)|num_elems(u32)|

Benchmark (median of 10 reps): page pre-decode is ~1.0–3.6x faster than
V2 (largest for short values), and the contiguous layout compresses
~1–11% smaller after ZSTD.
`create_texpr_literal_node<TYPE_VARBINARY>` treated the input pointer as
`std::string*`, but Doris `Field` stores `TYPE_VARBINARY` values as
`StringView`. When TopN predicate conversion builds a VARBINARY literal
from a `Field`, the helper reinterprets a `StringView*` as a
`std::string*`, which can make `std::string` assignment read a bogus
size and request a huge allocation under ASAN.

This PR reads VARBINARY literal input as `StringView`, copies the exact
byte range into the thrift literal, and adds VARBINARY coverage for
`create_texpr_node_from(Field, TYPE_VARBINARY, ...)` and `VLiteral`
round trip. It also wires the `const void*` helper for `TYPE_VARBINARY`.
…e candidate (#64062)

### What problem does this PR solve?

Problem Summary:

Add `tools/release-tools/`, a set of helper scripts for a Release
Manager (RM) to cut an Apache Doris **source** release candidate in
three steps:

- `01-check-env.sh` — check / prepare the GPG signing environment and
ASF credentials.
- `02-package-sign-upload.sh` — `git archive` the tag, GPG-sign,
generate sha512, upload to the dev SVN.
- `03-vote-mail.sh` — generate the `[VOTE]` email draft.
- `release.env` — shared config (version, paths, signing key, SVN URLs,
email); edit per release.
- `README.md` — usage.

The scripts are reusable across releases (everything version-specific
lives in `release.env`). Branch prep, issue cleanup, patch merges and
tag creation are out of scope.
morrySnow and others added 30 commits June 23, 2026 10:26
test_show_create_table_nereids duplicate with test_show_create_table
test_show_create_table's outfile is useless
#64562)

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary:

`regression-test/suites/load_p2/tvf/test_s3_tvf.groovy` configured
several S3 TVF attributes with both a virtual-host style URI and
`use_path_style=true`:

```text
uri = "s3://${bucket}.${endpoint}/..."
use_path_style = "true"
```

These two settings conflict. Aliyun OSS rejects path-style access for
this bucket with `HTTP 403 SecondLevelDomainForbidden` and the message
`Please use virtual hosted style to access`. The regression case could
pass on the previous endpoint, but failed after the P2 environment
switched to the Aliyun internal Beijing endpoint where OSS enforces
virtual-host style access.

This PR fixes the regression case by removing the six active
`addProperty("use_path_style", "true")` settings whose URI is already in
virtual-host form, so the SDK sends requests in the addressing style
required by OSS. The remaining S3 TVF attributes in this file do not set
`use_path_style` and keep their previous behavior.

This PR also improves the failure path in the test. Previously,
`assertTrue(attribute.expectFiled)` threw immediately and the later
`logger.info("error: ", ex)` line was skipped, so the failure only
showed a bare assertion line. The exception is now logged before the
assertion, and the assertion message includes the loop index, table
name, property map, and original error message.

Manual verification used the same OSS bucket, key prefix, and endpoint
while only changing the addressing style:

```text
Path-style request:
https://oss-cn-beijing.aliyuncs.com/doris-regression-bj/?prefix=regression/load/data/kd16=abcdefg/&max-keys=2
Result: HTTP 403 SecondLevelDomainForbidden

Virtual-host request:
https://doris-regression-bj.oss-cn-beijing.aliyuncs.com/?prefix=regression/load/data/kd16=abcdefg/&max-keys=2
Result: HTTP 200 OK
```

The same behavior was reproduced through Doris S3 TVF: the query fails
with `use_path_style=true` and succeeds after the property is removed.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Improve BE unit coverage for low-covered `be/src/core` paths. The change
removes unused core helpers (`ArenaWithFreeLists`, `nested_utils`, and
an unused `LargeIntValue` hash helper), then adds focused tests for
int128 utilities, large integer stream conversion, column filter helper
behavior, comparison specializations, and several data type serde
branches including Nothing, Bitmap, HLL, QuantileState, Variant, Map,
and Struct.

### Release note

None
### What problem does this PR solve?
the original description of sum0 is hard to understand. adjust the comments
…63366)

Problem Summary:

Move local exchange (LE) planning from BE's `_plan_local_exchange`
(pipeline build time) to a new FE-side planner. The FE planner mirrors
BE semantics, brings several correctness fixes, and is gated by a
session variable so the legacy BE path stays available as a fallback.

**Core design**

- New `AddLocalExchange` pass runs after `DistributePlanner`, walking
each fragment's plan tree bottom-up via the polymorphic
`PlanNode.enforceAndDeriveLocalExchange()`. Each node declares what
distribution it requires of its children; the framework inserts
`LocalExchangeNode` where needed.
- `LocalExchangeNode` represents intra-fragment data redistribution and
supports PASSTHROUGH, GLOBAL/LOCAL/BUCKET HASH_SHUFFLE, BROADCAST,
PASS_TO_ONE, ADAPTIVE_PASSTHROUGH, LOCAL_MERGE_SORT, NOOP.
- Per-BE instance semantics: `maxPerBeInstances` (max pipeline instances
assigned to any single BE) is used instead of global instance count to
match BE's `_num_instances` check. Planning is a no-op when
`maxPerBeInstances == 1`.
- Serial → non-serial fan-out: when a serial operator feeds a non-serial
parent without an intermediate LE, the framework inserts a PASSTHROUGH
LE to restore N-task parallelism, matching BE's
`required_data_distribution()` rule.
- Requirement-based exchange type resolution via
`LocalExchangeTypeRequire`: `RequireHash` adapts to any hash flavour,
`RequireSpecific` preserves the exact requested type.

**AggregationNode correctness fixes**

PR #62438 introduced a semantic split for
`required_data_distribution=HASH` (correctness-required vs
performance-only). BE's `!_needs_finalize &&
!enable_local_exchange_before_agg → base` early-return conflates both
intents in `AggSinkOperatorX` and `DistinctStreamingAggOperatorX`,
wrongly catching FIRST_MERGE (correctness) / non-streaming dedup
(correctness) and producing PASSTHROUGH-over-serial-child → wrong
aggregation results. The FE planner adds the missing `!isMerge()` /
`useStreamingPreagg=true` guards so FIRST_MERGE and non-streaming dedup
always emit HASH, regardless of the flag. Also adds
`requiresShuffleForCorrectness()` (mirrors BE's
`is_shuffled_operator()`) so SetOperationNode propagates the "downstream
depends on hash" flag correctly through chains.

**Session variables**

- `enable_local_shuffle_planner` (default true) — use FE planner; when
false, BE plans LE itself via the legacy path.
- `enable_local_shuffle` — master switch.
- `enable_local_exchange_before_agg` — mirrors #62438.

**Architectural notes**

This PR puts the FE planner in the driver's seat for LE insertion but
intentionally keeps BE-side machinery as a fallback:

1. `is_serial_operator` is still computed on both sides — any future
change to BE's per-operator C++ override must be mirrored in FE.
2. Legacy BE planner
(`pipeline_fragment_context.cpp::_plan_local_exchange`) is preserved and
gated by `runtime_state.h::plan_local_shuffle()`; the two paths are
mutually exclusive.
3. `_propagate_local_exchange_num_tasks` is kept as a runtime safety net
for paired-pipeline num_tasks mismatches.

**Thrift enum rename**

The intra-fragment exchange enum is renamed `ExchangeType` →
`TLocalPartitionType` (and `HASH_SHUFFLE` →
`GLOBAL_EXECUTION_HASH_SHUFFLE`) for clarity; the BE operator headers
are updated mechanically. This accounts for the otherwise-mechanical BE
`.h` churn.

### Release note

Add session variable `enable_local_shuffle_planner` (default true) to
control whether local exchange nodes are planned in FE (new path) or in
BE (legacy `_plan_local_exchange`). The two paths are mutually
exclusive; the legacy path remains intact behind this flag.

Co-authored-by: Gabriel <liwenqiang@selectdb.com>
### What problem does this PR solve?

Some function implementations cloned nullable null maps, array offsets,
or pass-through columns even though the result only needs to share
immutable column data. This change reuses those COW subcolumns directly
in non-mutating paths and keeps explicit clones for paths that modify
result data.


### Release note

None
…ion (#64593)

Optimize `percentile_reservoir` aggregation performance by reducing
per-row aggregate function overhead and using faster internal sorting
for reservoir samples.

before:
```sql
Doris> select percentile_reservoir(FUniqID, 0.9999) from hits_100m;
+---------------------------------------+
| percentile_reservoir(FUniqID, 0.9999) |
+---------------------------------------+
|                 9.222511254540202e+18 |
+---------------------------------------+
1 row in set (1.292 sec)
```

now:
```sql
Doris> select percentile_reservoir(FUniqID, 0.9999) from hits_100m;
+---------------------------------------+
| percentile_reservoir(FUniqID, 0.9999) |
+---------------------------------------+
|                 9.222511254540202e+18 |
+---------------------------------------+
1 row in set (0.537 sec)
```
### What problem does this PR solve?

Problem Summary: Consecutive TopN nodes were merged only when the child
order key list was a prefix of the parent order key list. When the
parent order key list was shorter and was instead a prefix of the child
list, the rule kept both TopN nodes even though the child ordering can
serve as a deterministic tie-breaker for the parent ordering. This
change allows that prefix direction, keeps the longer order key list in
the merged TopN, and adjusts LogicalTopN.withOrderKeys typing so callers
preserve their child type.
Problem Summary: concat_ws has a BE execution path for a single array
argument. When the array column row itself is NULL, the executor still
walked the nested array data and could return values from nested storage
instead of treating the NULL array row as empty input. Also, if the
optimizer rewrite is disabled, multiple array arguments can reach this
BE array path and were silently executed using only the first array
argument. This change keeps concat_ws return nullability unchanged,
skips nested data for NULL array rows, and rejects array-form concat_ws
calls unless the executor receives exactly separator plus one array
argument.

### Release note

Fix wrong concat_ws results for nullable array inputs and return an
error for unsupported multiple-array execution without optimizer
rewrite.
…4668)

## Problem

`test_sql_block_rule_status` fails intermittently on the community P0
pipeline (observed ~2/8 builds, clustered under high CI load across
several unrelated PRs). The failure is the exact-value assertion on the
`BLOCKS` column:

```
assertEquals("1", statusRows[0][9].toString())   // expected 1, actual 2
```

## Root cause

`BLOCKS` is read from `SqlBlockRule.getBlockCount()`, a **process-wide,
monotonically increasing** `LongCounterMetric`. The rule under test is
created with `global=true`, so its counter is shared cluster-wide and is
**not isolated** to this test's single matching query.

On a quiet FE a single matching query deterministically yields `BLOCKS
== 1`, but any additional matching evaluation of the same statement
under concurrent CI load (e.g. a transient JDBC/network statement
re-delivery) bumps the shared counter past 1. The defect is the test
asserting an exact value on a shared monotonic counter — not the
counting logic itself. This is a pre-existing flake, not introduced by
any specific PR.

## Fix

Assert the meaningful invariant — *the rule fired at least once* —
instead of an exact, racy count:

```groovy
assertTrue(Integer.parseInt(statusRows[0][9].toString()) >= 1,
        "BLOCKS should be >= 1 but was ${statusRows[0][9]}")
```
## Summary
- add Variant NestedGroup to the build feature list output in build.sh

## Validation
- bash -n build.sh
- git diff --check
…4221)

### What problem does this PR solve?

`vtablet_writer` and `vtablet_writer_v2` used fixed 10ms polling loops
while waiting for downstream node channels / load streams to finish
close or reach quorum success. When downstream recovery is slow,
upstream close wait may repeatedly scan unfinished channels and consume
unnecessary CPU.

This PR changes close wait to an event-driven wakeup model:

- `vtablet_writer`:
- Adds a close wait condition variable and version counter in
`IndexChannel`.
- `VNodeChannel` notifies close wait when the last add-block RPC
finishes or when the channel is cancelled.
- `IndexChannel::close_wait()` waits on the notification instead of
polling every 10ms.

- `vtablet_writer_v2`:
  - Adds close wait notification helpers in `LoadStreamStub`.
  - Stream close and cancel paths notify close wait.
- `VTabletWriterV2::_close_wait()` waits on stream close events instead
of polling every 10ms.

The existing quorum success logic and max wait timeout behavior are
preserved. A bounded fallback wait is kept so timeout and cancellation
state can still be refreshed even if no downstream event arrives.
… reuse the cdc reader (#64423)

### What problem does this PR solve?

Problem Summary:

For from-to (MySQL/PG CDC) streaming jobs, once a job enters the
incremental (binlog) phase, two issues hurt throughput:

- On the **FE** side, every polling round (default `max_interval` = 10s)
re-selects a BE via global round-robin, so the task drifts across BEs
with no job→BE affinity.
- On the **cdc_client** side, although per-job reader ownership and a
per-job fixed replication slot already exist, the live reader is not
actually reused: the stream reader is closed and rebuilt on every round.

As a result every round rebuilds the reader. For PG this means
reconnecting the replication slot and re-locating the WAL position (~15s
each round), which together with large-transaction buffering is a major
cause of idle / low-throughput stalls in the incremental phase.
Stream load does not support `compress_type=zstd` in the shared load
format parser. Async group commit also checks only legacy compressed CSV
format enum values when estimating compressed input size, so
`compress_type` based compressed input is not handled consistently by
stream load and HTTP stream load.

This PR adds ZSTD parsing in `LoadUtil::parse_format`, adds a shared
`LoadUtil::is_compressed_load` helper for `compress_type` and legacy
compressed CSV format types, and uses it in stream load and HTTP stream
group commit paths. This PR also adds BE UT and regression coverage for
ZSTD CSV/JSON stream load and group commit stream/HTTP stream load.
### What problem does this PR solve?

Issue Number: None

Related PR: #57133

Problem Summary:

`BaseTabletsChannel::_write_block_data` can run concurrently with
`incremental_open` for the same tablets channel. `_tablet_writers` is an
`std::unordered_map` protected by `_tablet_writers_lock` when writers
are inserted, but the tablet load rowset info lookup read the map
without holding the lock.

A concurrent `emplace` may rehash `_tablet_writers`, so the unlocked
lookup can race with bucket reallocation. This patch protects the lookup
with `_tablet_writers_lock` and avoids using unordered_map iterators
after the lock is released. The actual writer operations still run
outside `_tablet_writers_lock`, so the lock remains scoped to the map
access.
### What problem does this PR solve?


Add an explicit block check to reject null column or type pointers at
operator sink/get_block boundaries, while keeping the existing type
compatibility check unchanged.


### Release note

None
This PR exposes `variant_enable_nested_group` as a public VARIANT
property and wires the related configuration through parser/type
serialization.

Main changes:
- Allow `variant_enable_nested_group` in VARIANT predefined fields.
- Disable doc mode and sparse-column related options when NestedGroup is
enabled.
- Serialize `variant_enable_nested_group` in `VariantType#toSql`.
- Add `variant_nested_group_max_depth` config and make the default
NestedGroup write provider explicitly return not-supported status when
the write path is unavailable.
- Update FE/BE tests for parser behavior, type serialization, and
disabled NestedGroup write-path handling.
### What problem does this PR solve?

Routine load submit failures can renew a task directly from the
scheduler after the task has begun a transaction. That path mutates the
job's `routineLoadTaskInfoList` without holding the job write lock,
racing with scheduler idle-slot counting that reads the same list. This
PR protects the submit-failure renew path with the job write lock,
matching the existing timeout and transaction-status renew paths, and
adds unit coverage for the locking behavior.
… ARN (#64766)

### What problem does this PR solve?

S3 storage vault creation only treated role ARN as the
credential-provider path. When users configured
`s3.credentials_provider_type` without `s3.role_arn`, FE did not persist
the provider type into `ObjectStoreInfoPB`, and Cloud meta-service still
required AK/SK for the vault. The recycler also only read credential
provider type inside the role ARN branch.

This change allows S3 storage vaults to use an explicit credentials
provider type without role ARN. FE now writes `cred_provider_type` when
`s3.credentials_provider_type` or `AWS_CREDENTIALS_PROVIDER_TYPE` is
set, Cloud meta-service accepts credential-provider-based S3 vaults
without AK/SK, and the recycler reads the provider type independently
from role ARN.
### What problem does this PR solve?

Before this change, S3-compatible glob listing derived the object-store
`ListObjects` prefix by stopping at the first glob metacharacter. For a
path like:


`s3://bucket/asin_trend/sale/month/date=2025-{0[3-9],1[0-2]}-01/mp_id=8/0/0/436/*`

the old behavior listed the broad prefix:

`asin_trend/sale/month/date=2025-`

and then filtered all returned object keys in FE. If many unrelated
objects existed under `date=2025-*`, for example other dates, `mp_id`s,
or deeper paths, S3 TVF planning could spend a long time listing and
filtering files before query execution started.

After this change, Doris expands safely enumerable glob fragments before
issuing object-store list requests. The same path is now listed through
narrower prefixes such as:

`asin_trend/sale/month/date=2025-03-01/mp_id=8/0/0/436/`
...
`asin_trend/sale/month/date=2025-12-01/mp_id=8/0/0/436/`

Doris still applies the full glob regex after listing, so result
correctness is unchanged. The optimization only reduces the remote
listing scope. Expansion is limited to bounded brace alternations and
positive character classes, with a hard cap to avoid generating too many
prefixes. Existing pagination behavior through `startAfter` and
`maxFile` is preserved.
## Summary
- add a BE CMake option for Variant NestedGroup extension modules
- let the Storage target replace the default provider with external
extension sources when the option is enabled
- let BE unit tests include matching external extension test sources
when the option is enabled
- keep the change limited to existing build files shared by the public
and private trees
…er-row serialization (#64612)

The old per-row estimateSingleRowPayloadBytes ZSTD-serialized a one-row
batch for every row (CPU-heavy and ~25x oversized); sum
FieldVector.getBufferSize() over the whole batch instead, and rotate the
block lazily.
### What problem does this PR solve?

Problem Summary: FE metrics exposed current connection counts but did
not expose the configured maximum total connection count or each user's
max_user_connections value. This change adds doris_fe_connection_max
from qe_max_connection and doris_fe_user_connection_max with a user
label. User max connection metrics are initialized from all user
properties when MetricRepo starts and are synchronized when users are
created, dropped, or when user properties are updated or replayed.

### Release note

Add FE metrics doris_fe_connection_max and doris_fe_user_connection_max.
… test (#64756)

### What problem does this PR solve?

Issue Number: #64464

Related PR: N/A

Problem Summary:

When querying a SQL Server JDBC catalog, a predicate on a `bit` column
such as `WHERE bit_value = '1'` is folded to a boolean literal during
analysis. The pushed-down predicate must be rendered as an integer (`=
1` / `= 0`) for SQL Server, never the `TRUE` / `FALSE` keyword: SQL
Server has no boolean literal and reports `SQLServerException: Invalid
column name 'TRUE'` (see #64464).

On current master this is already handled correctly. The JDBC pushdown
path was refactored to the connector SPI (`PluginDrivenScanNode` ->
`ExprToConnectorExpressionConverter` -> `JdbcQueryBuilder`), and
`JdbcQueryBuilder.formatBooleanLiteral()` renders booleans per dialect
(`SQLSERVER` / `ORACLE` / `OCEANBASE_ORACLE` / `DB2` -> `1`/`0`, others
-> `TRUE`/`FALSE`). `JdbcQueryBuilderTest` already unit-tests this.

What was missing is the **end-to-end** regression test that the issue
triage explicitly asked for. This PR adds it to the SQL Server docker
JDBC suite (`test_sqlserver_jdbc_catalog.groovy`), covering `bit_value =
'1'`, `bit_value = '0'` and `bit_value in ('1', '0')`: it asserts via
`explain` that the pushed remote SQL renders `[bit_value] = 1` /
`[bit_value] = 0`, and executes the queries end-to-end (which throw
`Invalid column name 'TRUE'` on the buggy path).

Note: `branch-4.0` still uses the old `JdbcScanNode` /
`ExprToSqlVisitor` path, which renders the dialect-agnostic
`TRUE`/`FALSE` and is what triggers the bug reported in #64464. That
branch needs a separate code fix; this regression test alone would fail
there and is not sufficient on its own.
#64705)

### What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary:
This PR avoids blocking external meta cache invalidation on slow miss
loads in FE. Previously, `MetaCacheEntry` relied on Caffeine's
synchronous loading path for cache misses. When an external metadata
loader became slow, operations that invalidate the same cache, such as
`REFRESH CATALOG` and the corresponding replay path, could wait on the
slow load and block the replay-related invalidation flow.

Implementation summary:
- Keep the existing `LoadingCache` to preserve current hit-path behavior
and `refreshAfterWrite` support.
- Add a manual miss-load path behind a new FE config switch, using
`getIfPresent()` instead of synchronous `LoadingCache.get()` for misses.
- Deduplicate concurrent miss loads with striped locks inside
`MetaCacheEntry`.
- Add an entry-level `invalidateGeneration` counter. Each invalidate
increments the generation before clearing cache state.
- Record the generation before a manual miss load, check it once before
`put()`, and check it again after `put()`. If invalidation happens
during the race window, the just-loaded value is removed so stale data
is not kept in cache.
- Keep null miss-load results uncached so the manual path does not
attempt to put null into Caffeine.

Configuration:
- Add FE config `enable_external_meta_cache_manual_miss_load`, default
`false`.
- When it is `false`, `MetaCacheEntry` keeps the original synchronous
Caffeine miss-load behavior.
- When it is `true`, `MetaCacheEntry` uses the manual miss-load path
plus `invalidateGeneration` protection.

Scope and limitations:
- This change applies to `MetaCacheEntry` used by external metadata
cache paths in FE. It does not cover the legacy `MetaCache`.
- `LegacyMetaCacheFactory` is intentionally not refactored in this PR. A
follow-up PR will rework that path with `MetaCache`, and the legacy
factory changes are left to that dedicated refactor.
- The protection is designed for manual miss loads. It does not make
Caffeine's asynchronous `refreshAfterWrite` reload generation-aware.
- As a result, `refreshAfterWrite` is still preserved, but an async
refresh result may still write back after an invalidate. That is an
intentional trade-off in this version.
- The new regression case is valuable as a reference and for suitable
environments, but it may be skipped in standard CI because it depends on
JDBC regression setup, FE debug points, and an external MySQL/JDBC
environment.

### Release note

None

### Check List (For Author)

- Test
    - [ ] Regression test
    - [x] Unit Test
    - [x] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason

Manual test:
1. Reproduced the blocking path with `REFRESH CATALOG` against a JDBC
external catalog and a debug point that sleeps in
`PluginDrivenExternalTable.initSchema`.
2. Repeated the baseline scenario 5 times with
`enable_external_meta_cache_manual_miss_load=false` and observed
`REFRESH CATALOG` blocked for about 14s while `DESC` stayed slow.
3. Repeated the optimized scenario 5 times with
`enable_external_meta_cache_manual_miss_load=true` and observed `REFRESH
CATALOG` return within about 1s while `DESC` remained slow.
4. Added a regression case as a manual-test reference because its
execution depends on JDBC regression environment and FE debug-point
availability.

Unit test:
- `FE_UT_PARALLEL=1 ./run-fe-ut.sh --run
org.apache.doris.datasource.metacache.MetaCacheEntryTest`

- Behavior changed:
    - [x] Yes.

Behavior change:
- `REFRESH CATALOG` and the corresponding FE invalidation path are no
longer blocked by slow external metadata miss loads in this
`MetaCacheEntry` implementation.

- Does this need documentation?
    - [x] No.
    - [ ] Yes.

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label
Related PR: #63191

Problem Summary:

Arrow 17 defaults to `C++17` when CMAKE_CXX_STANDARD is not specified,
while Doris BE is built with `C++20`. This can make header-defined
inline/template code from Arrow Flight and its dependencies be compiled
under different C++ standard modes in the same final binary.

In particular, Arrow Status-related inline paths may generate different
implementations across C++17 and C++20, such as different initialization
strategies for function-local static std::string objects:

code:
```cpp
const std::string& get_empty_string() {
    static const std::string s = "";
    return s;
} 
```

cpp17 lazy initialization:
```asm
get_empty_string[abi:cxx11]():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 64
        cmp     byte ptr [rip + guard variable for get_empty_string[abi:cxx11]()::s[abi:cxx11]], 0
        jne     .LBB0_4
        lea     rdi, [rip + guard variable for get_empty_string[abi:cxx11]()::s[abi:cxx11]]
        call    __cxa_guard_acquire@PLT
        cmp     eax, 0
        je      .LBB0_4
        lea     rdx, [rbp - 33]
        mov     qword ptr [rbp - 32], rdx
        mov     rax, qword ptr [rbp - 32]
        mov     qword ptr [rbp - 8], rax
        lea     rdi, [rip + get_empty_string[abi:cxx11]()::s[abi:cxx11]]
        lea     rsi, [rip + .L.str]
        call    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::basic_string<std::allocator<char>>(char const*, std::allocator<char> const&)
        jmp     .LBB0_3
.LBB0_3:
        lea     rax, [rbp - 33]
        mov     qword ptr [rbp - 24], rax
        mov     rdi, qword ptr [rbp - 24]
        call    std::__new_allocator<char>::~__new_allocator() [base object destructor]
        lea     rdi, [rip + std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::~basic_string() [base object destructor]]
        lea     rsi, [rip + get_empty_string[abi:cxx11]()::s[abi:cxx11]]
        lea     rdx, [rip + __dso_handle]
        call    __cxa_atexit@PLT
        lea     rdi, [rip + guard variable for get_empty_string[abi:cxx11]()::s[abi:cxx11]]
        call    __cxa_guard_release@PLT
```

cpp20 constant initialization: 
```asm
get_empty_string[abi:cxx11]():
        push    rbp
        mov     rbp, rsp
        lea     rax, [rip + get_empty_string[abi:cxx11]()::s[abi:cxx11]]
        pop     rbp
        ret

get_empty_string[abi:cxx11]()::s[abi:cxx11]:
        .quad   get_empty_string[abi:cxx11]()::s[abi:cxx11]+16
        .quad   0
        .zero   16
```

Mixing those definitions through `weak/COMDAT` symbols is not a
supported build model and can surface as runtime crashes in Flight
error/status handling paths.
### What problem does this PR solve?

ColumnElementView exposed ptr_at() only to adapt predicate set lookups
that accepted raw void pointers. For string columns this required a
mutable temporary StringRef inside the view, making the element access
API harder to reason about.

This change removes ColumnElementView::ptr_at() and the string staging
field, updates in-list predicate evaluation to use
get_element()/get_data() with HybridSetBase::find(data, size), and
replaces the predicate selector macro with a small templated helper that
accepts named lambdas.


### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?



`from_base64` and `from_base64_binary` call the base64 decoder after
sizing the output buffer from `len / 4 * 3`. For invalid input whose
length is not a multiple of four, this can pass an undersized
destination buffer into the decoder before the function marks the row as
invalid.

Root cause: the functions only handled decoder failure after invoking
the decoder, but did not reject impossible base64 lengths first.

This patch returns `NULL` for inputs with invalid base64 length before
decoding, keeping the existing invalid-input SQL behavior while avoiding
unsafe decoder calls.


### Release note

None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.