Skip to content

[perf](s3) push down expandable S3 glob prefixes#64684

Merged
sollhui merged 3 commits into
apache:masterfrom
sollhui:opt_glob_list
Jun 24, 2026
Merged

[perf](s3) push down expandable S3 glob prefixes#64684
sollhui merged 3 commits into
apache:masterfrom
sollhui:opt_glob_list

Conversation

@sollhui

@sollhui sollhui commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Before this change, S3-compatible glob listing derived the object-store ListObjects prefix by stopping at the first glob metacharacter. For a path like:

s3://bucket/asin_trend/sale/month/date=2025-{0[3-9],1[0-2]}-01/mp_id=8/0/0/436/*

the old behavior listed the broad prefix:

asin_trend/sale/month/date=2025-

and then filtered all returned object keys in FE. If many unrelated objects existed under date=2025-*, for example other dates, mp_ids, or deeper paths, S3 TVF planning could spend a long time listing and filtering files before query execution started.

After this change, Doris expands safely enumerable glob fragments before issuing object-store list requests. The same path is now listed through narrower prefixes such as:

asin_trend/sale/month/date=2025-03-01/mp_id=8/0/0/436/
...
asin_trend/sale/month/date=2025-12-01/mp_id=8/0/0/436/

Doris still applies the full glob regex after listing, so result correctness is unchanged. The optimization only reduces the remote listing scope. Expansion is limited to bounded brace alternations and positive character classes, with a hard cap to avoid generating too many prefixes. Existing pagination behavior through startAfter and maxFile is preserved.

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui

sollhui commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@sollhui

sollhui commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: requesting changes. I found one correctness issue in the new S3-compatible glob-prefix expansion.

Critical checkpoint conclusions:

  • Goal/test: the PR aims to reduce object-store listing scope while preserving glob result correctness. The new tests cover the intended bounded date/character-class case and next-cursor discovery across expanded prefixes, but they do not cover wildcard-bearing brace alternatives.
  • Scope: the code change is focused, but it is in the shared S3-compatible filesystem base used by S3/OSS/COS/OBS.
  • Concurrency/lifecycle/config/compatibility: no new concurrency, special lifecycle, configuration item, or serialization/protocol compatibility concern found.
  • Parallel paths: HDFS/Azure implementations are separate and unchanged; the changed shared base covers the S3-compatible providers consistently.
  • Error handling/observability/performance: existing IOException wrapping and logging behavior are unchanged. The performance optimization is valid for safely enumerable prefixes, but one unsafe expansion path needs correction.
  • Data correctness: one case can skip objects that match the glob before regex filtering, so this review requests changes.
  • Tests: I did not run FE UT locally because thirdparty/installed/bin/protoc is missing, and the standard run-fe-ut.sh path invokes generated-source.sh.

Existing review context: .code-review.Uw9GFW/pr_review_threads.md and pr_review_comments.json showed no existing inline comments or replies.
User focus: no additional user-provided review focus was present.

Subagent conclusions:

  • optimizer-rewrite reported OR-1, the same wildcard-bearing brace expansion bug; it was merged into M-1 and became inline comment C-1.
  • tests-session-config reported TSC-1 as the missing regression-test facet of M-1; it was merged as a duplicate/coverage note, not a separate inline issue.
  • Convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same current ledger/comment set.

### What problem does this PR solve?

Issue Number: None

Related PR: apache#64684

Problem Summary: The S3 glob prefix pushdown can expand bounded brace alternatives before listing objects. Before this change, when a brace arm contained an unbounded wildcard such as data/{foo*,bar*}/part.parquet, the recursive expansion returned the partial arm prefixes data/foo and data/bar even though those prefixes were not complete enumerable alternatives. The caller then appended the outer suffix and listed data/foo/part.parquet and data/bar/part.parquet, which could miss valid objects such as data/foobar/part.parquet. Now wildcard characters inside a non-partial recursive expansion abort the expansion, so the outer call falls back to the conservative common prefix data/. This preserves correctness while still allowing fully bounded alternatives to be pushed down.

### Release note

Fix S3 glob listing correctness for brace alternatives whose arms contain wildcard characters.

### Check List (For Author)

- Test: Unit Test
    - mvn -f fe/pom.xml -pl fe-filesystem/fe-filesystem-s3 -am -Dtest=S3FileSystemTest -DfailIfNoTests=false test
- Behavior changed: Yes. S3 glob prefix pushdown now falls back to the conservative common prefix instead of listing incorrect suffixed partial prefixes when a brace arm contains unbounded wildcards.
- Does this need documentation: No
@sollhui

sollhui commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

/review

@sollhui

sollhui commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 28886 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c8cd3f30caa3f4e1679f02e715a25b7a08f4c2f0, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17628	4091	4028	4028
q2	2032	306	194	194
q3	10298	1448	841	841
q4	4686	480	343	343
q5	7522	851	577	577
q6	190	166	139	139
q7	762	844	631	631
q8	9331	1547	1613	1547
q9	5758	4489	4526	4489
q10	6760	1782	1542	1542
q11	431	267	245	245
q12	626	424	292	292
q13	18120	3347	2716	2716
q14	275	259	244	244
q15	q16	783	772	705	705
q17	995	877	1024	877
q18	6765	5603	5452	5452
q19	1343	1270	1073	1073
q20	492	402	264	264
q21	6038	2698	2386	2386
q22	440	366	301	301
Total cold run time: 101275 ms
Total hot run time: 28886 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4316	4249	4237	4237
q2	330	354	224	224
q3	4609	4917	4422	4422
q4	2062	2158	1363	1363
q5	4414	4303	4320	4303
q6	231	172	125	125
q7	1729	1623	1471	1471
q8	2876	2234	2153	2153
q9	8207	8316	8000	8000
q10	4835	4787	4311	4311
q11	569	406	368	368
q12	785	742	556	556
q13	3328	3544	2878	2878
q14	282	310	293	293
q15	q16	729	720	649	649
q17	1344	1329	1323	1323
q18	8060	7323	7280	7280
q19	1163	1166	1122	1122
q20	2225	2207	1947	1947
q21	5253	4547	4355	4355
q22	503	454	393	393
Total cold run time: 57850 ms
Total hot run time: 51773 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172490 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c8cd3f30caa3f4e1679f02e715a25b7a08f4c2f0, data reload: false

query5	4306	623	497	497
query6	443	187	180	180
query7	4821	561	304	304
query8	379	206	198	198
query9	8790	4031	4019	4019
query10	444	308	260	260
query11	5872	2364	2136	2136
query12	156	103	101	101
query13	1243	602	425	425
query14	6333	5411	5048	5048
query14_1	4370	4389	4373	4373
query15	218	197	181	181
query16	988	477	448	448
query17	954	707	586	586
query18	2435	476	349	349
query19	210	194	148	148
query20	111	109	103	103
query21	237	143	119	119
query22	13643	13601	13328	13328
query23	17308	16491	16129	16129
query23_1	16243	16285	16299	16285
query24	7566	1797	1322	1322
query24_1	1321	1306	1341	1306
query25	583	469	406	406
query26	1337	333	171	171
query27	2638	546	345	345
query28	4459	2037	2046	2037
query29	1114	624	482	482
query30	316	228	199	199
query31	1116	1068	960	960
query32	113	62	62	62
query33	526	339	261	261
query34	1243	1131	650	650
query35	743	771	668	668
query36	1375	1377	1219	1219
query37	157	101	88	88
query38	1913	1733	1655	1655
query39	911	915	897	897
query39_1	866	908	867	867
query40	237	117	98	98
query41	68	63	64	63
query42	86	86	85	85
query43	317	327	278	278
query44	1409	765	782	765
query45	190	186	175	175
query46	1057	1208	756	756
query47	2354	2347	2266	2266
query48	414	393	292	292
query49	619	462	341	341
query50	990	355	267	267
query51	4358	4236	4231	4231
query52	83	81	69	69
query53	250	261	188	188
query54	280	206	206	206
query55	72	69	66	66
query56	225	220	208	208
query57	1419	1415	1327	1327
query58	241	209	214	209
query59	1546	1595	1414	1414
query60	288	242	221	221
query61	156	148	150	148
query62	695	645	580	580
query63	236	189	199	189
query64	2509	751	588	588
query65	4843	4774	4807	4774
query66	1781	481	335	335
query67	29786	29167	29655	29167
query68	3298	1507	1000	1000
query69	411	307	268	268
query70	1055	959	897	897
query71	285	230	212	212
query72	2930	2641	2315	2315
query73	853	780	425	425
query74	5121	4918	4795	4795
query75	2622	2581	2227	2227
query76	2331	1199	763	763
query77	365	383	269	269
query78	12380	12519	11853	11853
query79	1425	1166	821	821
query80	1270	463	385	385
query81	517	279	241	241
query82	610	159	125	125
query83	358	276	240	240
query84	317	149	112	112
query85	902	535	415	415
query86	446	318	298	298
query87	1855	1833	1762	1762
query88	3696	2792	2782	2782
query89	427	374	333	333
query90	1929	187	177	177
query91	169	163	136	136
query92	62	63	56	56
query93	1477	1523	968	968
query94	742	358	311	311
query95	707	462	348	348
query96	1082	777	341	341
query97	2694	2703	2550	2550
query98	219	210	198	198
query99	1154	1149	1023	1023
Total cold run time: 258481 ms
Total hot run time: 172490 ms

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the S3-compatible glob prefix expansion changes. I do not have additional inline findings beyond the existing thread about brace arms containing unbounded wildcards; the current head addresses that case by aborting non-partial brace-arm expansion on */? and falling back to the conservative prefix.

Checkpoint conclusions:

  • Goal and coverage: the PR narrows object-store listing for safely enumerable glob prefixes while still filtering with the full glob regex. The added S3 tests cover the date-prefix expansion, fallback for wildcard brace arms, and finding the next match across expanded prefixes after a page limit.
  • Scope: the change is focused on the S3-compatible filesystem helper/listing path plus S3 unit test exposure.
  • Correctness: multi-prefix scans are sorted and compacted before listing, keep startAfter as the global lower bound for each first page, and continue into later prefixes after hitting a limit until the next matching key is found.
  • Parallel paths: S3/OSS/COS/OBS share S3CompatibleFileSystem; Azure/HDFS/broker keep separate glob-list implementations and are not required to receive this optimization.
  • Config, compatibility, concurrency, lifecycle, persistence: no new config/session variables, protocol/storage-format changes, concurrency surfaces, or lifecycle/persistence changes were introduced.
  • Testing/style: test coverage is appropriate for the changed helper and listing behavior. No additional regression issue was found.

Subagent conclusions: optimizer-rewrite and tests-session-config both converged on NO_NEW_VALUABLE_FINDINGS; their only candidates were duplicate/dismissed notes, so the final inline comment set is empty.

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.35 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c8cd3f30caa3f4e1679f02e715a25b7a08f4c2f0, data reload: false

query1	0.00	0.00	0.00
query2	0.10	0.06	0.05
query3	0.26	0.14	0.14
query4	1.62	0.14	0.14
query5	0.25	0.22	0.22
query6	1.21	1.07	1.04
query7	0.04	0.01	0.01
query8	0.06	0.04	0.04
query9	0.39	0.36	0.34
query10	0.55	0.55	0.54
query11	0.19	0.15	0.14
query12	0.19	0.15	0.15
query13	0.48	0.49	0.47
query14	1.01	1.02	1.00
query15	0.60	0.60	0.59
query16	0.32	0.31	0.31
query17	1.13	1.14	1.12
query18	0.24	0.21	0.21
query19	1.96	1.99	1.95
query20	0.01	0.01	0.01
query21	15.43	0.24	0.13
query22	4.85	0.05	0.05
query23	16.17	0.30	0.12
query24	3.14	0.42	0.33
query25	0.12	0.05	0.05
query26	0.73	0.20	0.16
query27	0.05	0.03	0.05
query28	3.57	0.95	0.53
query29	12.51	4.34	3.48
query30	0.27	0.16	0.15
query31	2.77	0.59	0.31
query32	3.22	0.61	0.49
query33	3.16	3.23	3.24
query34	15.67	4.20	3.52
query35	3.48	3.53	3.52
query36	0.55	0.43	0.40
query37	0.09	0.07	0.06
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.15
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 96.84 s
Total hot run time: 25.35 s

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 28918 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8a60d3f5b7e3900e09c7452de95d1783e5976be0, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17602	4076	4046	4046
q2	2034	306	184	184
q3	10304	1396	812	812
q4	4673	465	332	332
q5	7518	842	562	562
q6	179	162	134	134
q7	791	862	623	623
q8	9337	1562	1617	1562
q9	5789	4473	4529	4473
q10	6796	1795	1549	1549
q11	441	279	246	246
q12	624	430	288	288
q13	18100	3448	2744	2744
q14	270	259	235	235
q15	q16	789	787	714	714
q17	981	958	900	900
q18	6853	5668	5485	5485
q19	1301	1362	1099	1099
q20	501	392	266	266
q21	5928	2643	2368	2368
q22	434	361	296	296
Total cold run time: 101245 ms
Total hot run time: 28918 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4350	4280	4270	4270
q2	337	360	223	223
q3	4555	4917	4348	4348
q4	2071	2136	1361	1361
q5	4405	4290	4304	4290
q6	236	175	129	129
q7	1705	1639	1602	1602
q8	2809	2174	2177	2174
q9	8115	8304	8051	8051
q10	4784	4746	4300	4300
q11	577	418	407	407
q12	738	752	532	532
q13	3188	3641	2970	2970
q14	307	300	269	269
q15	q16	731	728	650	650
q17	1355	1304	1322	1304
q18	7950	7144	7185	7144
q19	1195	1167	1156	1156
q20	2265	2204	1943	1943
q21	5203	4526	4417	4417
q22	528	456	402	402
Total cold run time: 57404 ms
Total hot run time: 51942 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172474 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8a60d3f5b7e3900e09c7452de95d1783e5976be0, data reload: false

query5	4334	626	475	475
query6	433	185	171	171
query7	4851	530	308	308
query8	367	215	192	192
query9	8727	4092	4074	4074
query10	463	309	252	252
query11	5934	2310	2123	2123
query12	162	102	99	99
query13	1308	626	409	409
query14	6293	5406	5073	5073
query14_1	4387	4357	4373	4357
query15	208	197	176	176
query16	1010	455	454	454
query17	956	713	576	576
query18	2510	512	350	350
query19	206	190	149	149
query20	112	119	109	109
query21	214	147	116	116
query22	13601	13591	13434	13434
query23	17498	16434	16231	16231
query23_1	16347	16248	16194	16194
query24	7574	1780	1305	1305
query24_1	1316	1331	1314	1314
query25	566	470	398	398
query26	1305	320	176	176
query27	2667	542	347	347
query28	4435	2109	2067	2067
query29	1066	609	474	474
query30	306	243	198	198
query31	1119	1068	954	954
query32	102	60	58	58
query33	515	306	243	243
query34	1176	1160	653	653
query35	740	784	671	671
query36	1347	1396	1241	1241
query37	157	108	90	90
query38	1890	1714	1671	1671
query39	919	918	902	902
query39_1	895	892	856	856
query40	210	120	97	97
query41	64	62	61	61
query42	88	88	88	88
query43	315	322	290	290
query44	1412	787	777	777
query45	196	185	173	173
query46	1105	1177	740	740
query47	2397	2357	2222	2222
query48	406	408	291	291
query49	601	454	351	351
query50	1062	341	266	266
query51	4391	4248	4236	4236
query52	81	80	69	69
query53	246	261	190	190
query54	268	207	197	197
query55	74	70	66	66
query56	234	211	226	211
query57	1428	1408	1348	1348
query58	241	203	207	203
query59	1591	1660	1487	1487
query60	281	248	259	248
query61	148	148	149	148
query62	688	643	586	586
query63	226	188	200	188
query64	2525	765	590	590
query65	4830	4787	4765	4765
query66	1783	475	337	337
query67	30046	29807	29686	29686
query68	3062	1498	940	940
query69	431	310	270	270
query70	1062	945	985	945
query71	292	236	209	209
query72	2874	2621	2307	2307
query73	808	790	423	423
query74	5110	4946	4803	4803
query75	2594	2587	2241	2241
query76	2306	1155	778	778
query77	353	382	274	274
query78	12394	12659	11859	11859
query79	1342	1123	761	761
query80	568	514	379	379
query81	456	282	238	238
query82	362	152	119	119
query83	356	279	245	245
query84	305	144	107	107
query85	884	530	421	421
query86	375	308	267	267
query87	1831	1827	1765	1765
query88	3685	2795	2775	2775
query89	441	373	328	328
query90	1860	187	178	178
query91	169	155	136	136
query92	61	58	52	52
query93	1442	1485	834	834
query94	540	345	301	301
query95	743	375	341	341
query96	1023	821	361	361
query97	2681	2683	2547	2547
query98	221	202	201	201
query99	1199	1174	1042	1042
Total cold run time: 257223 ms
Total hot run time: 172474 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.04 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 8a60d3f5b7e3900e09c7452de95d1783e5976be0, data reload: false

query1	0.00	0.00	0.01
query2	0.09	0.05	0.05
query3	0.25	0.14	0.13
query4	1.61	0.14	0.14
query5	0.23	0.22	0.22
query6	1.26	1.07	1.06
query7	0.04	0.01	0.01
query8	0.06	0.03	0.04
query9	0.37	0.35	0.30
query10	0.54	0.55	0.55
query11	0.19	0.14	0.14
query12	0.18	0.15	0.14
query13	0.47	0.47	0.48
query14	1.01	1.03	0.99
query15	0.62	0.58	0.62
query16	0.32	0.32	0.31
query17	1.11	1.15	1.08
query18	0.22	0.21	0.21
query19	2.05	1.88	1.97
query20	0.01	0.02	0.01
query21	15.43	0.21	0.13
query22	4.86	0.05	0.04
query23	16.14	0.31	0.12
query24	2.95	0.42	0.31
query25	0.10	0.05	0.04
query26	0.72	0.22	0.16
query27	0.03	0.04	0.03
query28	3.54	0.95	0.54
query29	12.50	4.26	3.43
query30	0.27	0.14	0.16
query31	2.77	0.59	0.32
query32	3.22	0.59	0.48
query33	3.20	3.27	3.17
query34	15.67	4.19	3.51
query35	3.45	3.52	3.55
query36	0.54	0.44	0.43
query37	0.09	0.07	0.06
query38	0.04	0.04	0.04
query39	0.03	0.03	0.03
query40	0.18	0.16	0.14
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 96.54 s
Total hot run time: 25.04 s

@liaoxin01

Copy link
Copy Markdown
Contributor

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@liaoxin01 liaoxin01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found three additional correctness/compatibility issues in the current glob-prefix expansion: Unicode code-point handling, cross-prefix ordering, and S3 directory-bucket restrictions.

### What problem does this PR solve?

Issue Number: None

Related PR: apache#64684

Problem Summary: The S3 glob prefix pushdown expanded character classes and sorted expanded prefixes using Java UTF-16 char/String semantics. Before this change, a character class containing a supplementary Unicode code point such as an emoji could be expanded into unpaired surrogate prefixes, so the object listing could miss keys that the final glob regex would match. Expanded prefixes were also compacted in Java String order, which can differ from S3 UTF-8 binary listing order and make cross-prefix pagination unstable for supplementary characters. In addition, S3 directory bucket endpoints can reject expanded prefixes that do not end in '/', and their listing model is not compatible with the multi-prefix optimization. Now character classes containing surrogate code units fall back to the conservative static prefix, expanded prefixes are ordered by unsigned UTF-8 bytes, and S3 Express directory bucket endpoints use the slash-terminated static prefix instead of expanded multi-prefix listing.

### Release note

Fix S3 glob listing correctness for supplementary Unicode characters and avoid expanded glob prefix listing on S3 Express directory bucket endpoints.

### Check List (For Author)

- Test: Unit Test
    - mvn -f fe/pom.xml -pl fe-filesystem/fe-filesystem-s3 -am -Dtest=S3FileSystemTest -DfailIfNoTests=false test
- Behavior changed: Yes. Unsafe S3 glob prefix expansion now falls back to conservative listing, expanded prefix order follows S3 UTF-8 binary order, and S3 Express directory bucket endpoints do not use expanded multi-prefix listing.
- Does this need documentation: No
@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the current head against the changed S3 filesystem files and the existing inline review context. I did not find a new, non-duplicate issue to submit. The existing inline threads cover the previously substantiated risks around brace arms with unbounded wildcards, supplementary character classes, UTF-8 prefix ordering, and directory-bucket listing semantics.

Checkpoint conclusions:

  • Goal/test coverage: The PR improves S3-compatible glob listing by expanding bounded prefixes and adds focused S3 unit coverage for the reported edge cases. I could not run the targeted FE unit test because ./run-fe-ut.sh --run org.apache.doris.filesystem.s3.S3FileSystemTest fails during generated-source setup: thirdparty/installed/bin/protoc is missing in this runner.
  • Scope: The implementation is localized to S3-compatible glob prefix expansion, S3 directory-bucket fallback, and matching unit tests.
  • Concurrency/lifecycle: No new shared mutable state, threads, locks, or lifecycle-sensitive static initialization beyond immutable constants/comparators.
  • Configuration/compatibility: No user-facing config item is added. Production S3 provider construction passes typed S3FileSystemProperties, so the directory-bucket endpoint check is available on the normal SPI path. Non-directory S3-compatible providers retain the conservative fallback when expansion is unsafe or too large.
  • Parallel paths: S3, OSS, COS, and OBS share the S3-compatible base path; their object-list adapters honor continuation/start-after style options. AWS S3 directory-bucket behavior remains the already-raised existing thread rather than a new comment.
  • Tests/style: New unit tests cover the fixed edge cases, and git diff --check passed. Targeted FE UT execution was blocked by missing thirdparty dependencies.
  • Observability/transactions/persistence/FE-BE protocol: Not applicable to this filesystem-only change.

User focus: No additional user-provided review focus was supplied.

Subagent conclusions: optimizer-rewrite reported NO_NEW_VALUABLE_FINDINGS; tests-session-config reported NO_NEW_VALUABLE_FINDINGS and identified only the duplicate directory-bucket StartAfter concern already covered by discussion_r3453830420. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same ledger/comment set.

@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29171 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5c8e3ab366456f4bc02a629801137455c0900bae, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17683	4029	3987	3987
q2	2019	308	186	186
q3	10323	1491	822	822
q4	4678	470	333	333
q5	7534	870	573	573
q6	181	168	133	133
q7	772	843	623	623
q8	9405	1639	1642	1639
q9	6527	4572	4526	4526
q10	6793	1829	1532	1532
q11	437	272	244	244
q12	634	426	295	295
q13	18152	3417	2771	2771
q14	265	258	237	237
q15	q16	784	775	715	715
q17	1035	1003	1003	1003
q18	7159	5902	5541	5541
q19	1166	1231	1085	1085
q20	494	393	262	262
q21	5570	2552	2362	2362
q22	436	359	302	302
Total cold run time: 102047 ms
Total hot run time: 29171 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4350	4257	4250	4250
q2	325	352	224	224
q3	4578	5029	4474	4474
q4	2069	2206	1383	1383
q5	4442	4321	4333	4321
q6	224	175	126	126
q7	1727	1922	1797	1797
q8	2482	2205	2102	2102
q9	7982	7963	7932	7932
q10	4878	4770	4278	4278
q11	567	404	368	368
q12	897	768	536	536
q13	3274	3616	2972	2972
q14	290	306	272	272
q15	q16	703	733	664	664
q17	1362	1323	1331	1323
q18	8167	7497	6942	6942
q19	1131	1078	1076	1076
q20	2224	2209	1943	1943
q21	5268	4537	4450	4450
q22	516	454	400	400
Total cold run time: 57456 ms
Total hot run time: 51833 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172828 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 5c8e3ab366456f4bc02a629801137455c0900bae, data reload: false

query5	4301	640	470	470
query6	427	185	172	172
query7	4850	574	295	295
query8	360	212	198	198
query9	8742	4006	4013	4006
query10	429	320	252	252
query11	5837	2325	2140	2140
query12	153	127	94	94
query13	1243	605	438	438
query14	6355	5318	4981	4981
query14_1	4323	4324	4322	4322
query15	203	198	174	174
query16	1014	440	430	430
query17	913	692	556	556
query18	2415	468	335	335
query19	202	176	135	135
query20	110	108	105	105
query21	209	135	122	122
query22	13609	13648	13327	13327
query23	17434	16640	16269	16269
query23_1	16381	16295	16321	16295
query24	7511	1792	1319	1319
query24_1	1282	1315	1300	1300
query25	523	451	360	360
query26	1315	298	161	161
query27	2763	576	347	347
query28	4521	2043	2011	2011
query29	1053	604	475	475
query30	317	240	199	199
query31	1127	1083	973	973
query32	110	61	55	55
query33	515	307	246	246
query34	1191	1126	638	638
query35	760	792	683	683
query36	1408	1395	1275	1275
query37	155	102	93	93
query38	1893	1715	1679	1679
query39	962	946	892	892
query39_1	878	883	873	873
query40	231	126	105	105
query41	72	67	67	67
query42	90	90	87	87
query43	318	322	281	281
query44	1418	773	777	773
query45	200	184	181	181
query46	1082	1189	766	766
query47	2423	2374	2272	2272
query48	446	422	304	304
query49	637	483	366	366
query50	990	362	272	272
query51	4297	4292	4205	4205
query52	84	82	70	70
query53	256	261	194	194
query54	278	225	212	212
query55	74	69	66	66
query56	252	231	242	231
query57	1465	1459	1345	1345
query58	251	223	218	218
query59	1597	1637	1458	1458
query60	300	251	234	234
query61	180	171	168	168
query62	719	660	565	565
query63	229	193	209	193
query64	2591	801	647	647
query65	4877	4821	4807	4807
query66	1794	480	353	353
query67	29798	29617	29629	29617
query68	3074	1459	921	921
query69	429	308	288	288
query70	1087	1007	959	959
query71	297	235	215	215
query72	2943	2639	2346	2346
query73	843	747	445	445
query74	5129	4924	4803	4803
query75	2632	2596	2241	2241
query76	2308	1172	821	821
query77	355	369	278	278
query78	12366	12695	11924	11924
query79	1370	1185	764	764
query80	1257	480	387	387
query81	511	281	236	236
query82	586	157	118	118
query83	327	270	241	241
query84	310	144	115	115
query85	907	511	405	405
query86	427	293	257	257
query87	1830	1813	1782	1782
query88	3700	2811	2752	2752
query89	427	372	336	336
query90	1861	176	173	173
query91	167	158	134	134
query92	65	61	57	57
query93	1538	1463	831	831
query94	769	343	310	310
query95	678	381	440	381
query96	1143	778	364	364
query97	2769	2719	2612	2612
query98	222	211	197	197
query99	1210	1160	1038	1038
Total cold run time: 258808 ms
Total hot run time: 172828 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.18 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 5c8e3ab366456f4bc02a629801137455c0900bae, data reload: false

query1	0.00	0.00	0.00
query2	0.10	0.05	0.05
query3	0.26	0.13	0.13
query4	1.61	0.14	0.14
query5	0.26	0.23	0.22
query6	1.25	1.08	1.02
query7	0.03	0.01	0.00
query8	0.06	0.04	0.04
query9	0.38	0.31	0.32
query10	0.56	0.54	0.58
query11	0.20	0.14	0.14
query12	0.19	0.14	0.14
query13	0.46	0.47	0.47
query14	1.01	1.00	1.02
query15	0.62	0.61	0.59
query16	0.30	0.33	0.31
query17	1.06	1.07	1.09
query18	0.22	0.20	0.21
query19	2.09	1.99	1.96
query20	0.02	0.01	0.02
query21	15.44	0.22	0.14
query22	4.87	0.04	0.05
query23	16.14	0.31	0.11
query24	2.88	0.40	0.31
query25	0.13	0.05	0.04
query26	0.73	0.20	0.15
query27	0.05	0.03	0.04
query28	3.52	0.94	0.54
query29	12.50	4.37	3.48
query30	0.27	0.15	0.15
query31	2.78	0.60	0.31
query32	3.22	0.61	0.49
query33	3.17	3.26	3.23
query34	15.57	4.25	3.51
query35	3.52	3.57	3.52
query36	0.55	0.43	0.42
query37	0.09	0.07	0.07
query38	0.04	0.04	0.03
query39	0.04	0.04	0.03
query40	0.18	0.15	0.14
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 96.53 s
Total hot run time: 25.18 s

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/561) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/569) 🎉
Increment coverage report
Complete coverage report

@liaoxin01 liaoxin01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/569) 🎉
Increment coverage report
Complete coverage report

@sollhui sollhui merged commit bacfb0f into apache:master Jun 24, 2026
37 checks passed
@sollhui sollhui deleted the opt_glob_list branch June 24, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants