Skip to content

[fix](job) lock routine load task renew on submit failure#64731

Merged
sollhui merged 1 commit into
apache:masterfrom
sollhui:fix_rl_task_null
Jun 24, 2026
Merged

[fix](job) lock routine load task renew on submit failure#64731
sollhui merged 1 commit into
apache:masterfrom
sollhui:fix_rl_task_null

Conversation

@sollhui

@sollhui sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Routine load submit failures can renew a task directly from the scheduler after the task has begun a transaction. That path mutates the job's routineLoadTaskInfoList without holding the job write lock, racing with scheduler idle-slot counting that reads the same list. This PR protects the submit-failure renew path with the job write lock, matching the existing timeout and transaction-status renew paths, and adds unit coverage for the locking behavior.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/review

@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@sollhui sollhui force-pushed the fix_rl_task_null branch from 8cd389b to ae2d5f7 Compare June 23, 2026 09:33
@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/review

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29139 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ae2d5f7f8a0adba6b59afd51f363cf64d2e7ea80, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17700	4054	4063	4054
q2	2059	313	184	184
q3	10275	1440	825	825
q4	4681	469	346	346
q5	7602	877	573	573
q6	184	172	140	140
q7	777	828	626	626
q8	9332	1591	1586	1586
q9	5571	4517	4487	4487
q10	6742	1798	1570	1570
q11	444	273	245	245
q12	627	410	297	297
q13	18091	3338	2733	2733
q14	270	266	253	253
q15	q16	787	788	709	709
q17	911	928	961	928
q18	7171	5920	5487	5487
q19	1334	1269	1026	1026
q20	489	390	269	269
q21	5836	2608	2497	2497
q22	423	354	304	304
Total cold run time: 101306 ms
Total hot run time: 29139 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4325	4248	4282	4248
q2	320	347	220	220
q3	4619	4966	4408	4408
q4	2061	2152	1377	1377
q5	4440	4295	4311	4295
q6	241	179	131	131
q7	1723	1614	1794	1614
q8	2688	2226	2221	2221
q9	8345	8404	8211	8211
q10	4787	4760	4288	4288
q11	579	421	409	409
q12	771	780	561	561
q13	3268	3516	2899	2899
q14	298	306	290	290
q15	q16	714	739	625	625
q17	1343	1313	1342	1313
q18	8027	7414	7328	7328
q19	1185	1105	1100	1100
q20	2236	2259	1991	1991
q21	5254	4581	4434	4434
q22	510	447	402	402
Total cold run time: 57734 ms
Total hot run time: 52365 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 173022 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ae2d5f7f8a0adba6b59afd51f363cf64d2e7ea80, data reload: false

query5	4345	635	479	479
query6	446	197	182	182
query7	4815	558	300	300
query8	378	218	205	205
query9	8777	4135	4100	4100
query10	478	317	264	264
query11	5952	2347	2144	2144
query12	157	104	98	98
query13	1269	607	433	433
query14	6395	5390	5056	5056
query14_1	4411	4397	4383	4383
query15	209	196	178	178
query16	988	463	450	450
query17	951	728	576	576
query18	2472	497	348	348
query19	202	194	149	149
query20	114	111	110	110
query21	218	147	123	123
query22	13707	13595	13421	13421
query23	17422	16527	16164	16164
query23_1	16396	16339	16304	16304
query24	7516	1781	1298	1298
query24_1	1336	1312	1328	1312
query25	573	460	397	397
query26	1301	332	170	170
query27	2640	538	349	349
query28	4455	2051	2009	2009
query29	1072	598	469	469
query30	304	234	197	197
query31	1105	1072	968	968
query32	105	59	57	57
query33	516	320	249	249
query34	1198	1115	634	634
query35	754	792	674	674
query36	1357	1391	1254	1254
query37	152	109	88	88
query38	1881	1722	1658	1658
query39	932	932	893	893
query39_1	896	877	885	877
query40	214	119	99	99
query41	64	60	61	60
query42	88	86	85	85
query43	321	323	278	278
query44	1430	767	776	767
query45	197	184	174	174
query46	1095	1233	764	764
query47	2415	2341	2226	2226
query48	401	419	286	286
query49	626	446	343	343
query50	1004	352	265	265
query51	4476	4418	4389	4389
query52	79	80	68	68
query53	251	268	190	190
query54	257	224	208	208
query55	72	69	66	66
query56	224	217	204	204
query57	1419	1423	1309	1309
query58	257	216	210	210
query59	1587	1655	1409	1409
query60	276	241	222	222
query61	159	147	150	147
query62	696	648	588	588
query63	232	193	190	190
query64	2587	825	627	627
query65	4903	4796	4792	4792
query66	1782	471	342	342
query67	29740	28990	29637	28990
query68	3286	1575	1003	1003
query69	411	295	261	261
query70	1069	950	971	950
query71	299	272	210	210
query72	3170	2601	2334	2334
query73	882	765	472	472
query74	5106	4946	4788	4788
query75	2643	2610	2261	2261
query76	2352	1186	772	772
query77	381	367	293	293
query78	12571	12521	11904	11904
query79	1390	1267	753	753
query80	576	487	379	379
query81	453	286	246	246
query82	568	155	121	121
query83	329	274	238	238
query84	305	148	114	114
query85	841	522	414	414
query86	380	311	290	290
query87	1825	1840	1763	1763
query88	3667	2798	2774	2774
query89	425	382	324	324
query90	1943	185	178	178
query91	173	158	137	137
query92	63	62	53	53
query93	1541	1522	945	945
query94	528	358	311	311
query95	691	468	350	350
query96	1081	847	351	351
query97	2716	2702	2595	2595
query98	213	209	199	199
query99	1191	1143	1019	1019
Total cold run time: 258565 ms
Total hot run time: 173022 ms

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one concurrency issue in the submit-failure renew path.

Critical checkpoints:

  • Goal/test: the PR aims to protect routine-load task renew after submit failure with the job write lock. The new unit test proves the lock is held during unprotectRenewTask, but it does not cover concurrent job state transitions that remove the task before this lock is acquired.
  • Scope: the code change is narrow and follows the existing lock style for renew paths.
  • Concurrency/lifecycle: still problematic. A pause can clear routineLoadTaskInfoList between setOtherMsg() and the new locked renew, after which the renew adds a replacement task to a paused job.
  • Config/compatibility/persistence/observability: no new config, protocol, storage-format, or persistence changes were introduced.
  • Parallel paths: Kafka and Kinesis renew implementations both unconditionally add the replacement after remove(oldTask), so the state/member check needs to happen before calling them or inside a shared renew helper.
  • Test coverage: unit coverage was added for the lock contract, but coverage is missing for the concurrent pause/removal case.
  • User focus: no additional user-provided review focus.

Subagent conclusions: OPT-1 and TSC-1..TSC-4 were dismissed because they target files outside GitHub's authoritative PR 64731 patch. MAIN-1 is the only accepted inline comment. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for this ledger/comment set.

@github-actions

Copy link
Copy Markdown
Contributor

Codex automated review failed and did not complete.

Error: Codex completed, but no new pull request review was submitted for the current head SHA.
Workflow run: https://github.com/apache/doris/actions/runs/28013234298

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.21 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ae2d5f7f8a0adba6b59afd51f363cf64d2e7ea80, data reload: false

query1	0.01	0.00	0.01
query2	0.10	0.05	0.05
query3	0.25	0.14	0.14
query4	1.61	0.14	0.14
query5	0.24	0.22	0.23
query6	1.26	1.07	1.08
query7	0.04	0.01	0.01
query8	0.06	0.04	0.04
query9	0.38	0.31	0.33
query10	0.58	0.55	0.53
query11	0.18	0.14	0.14
query12	0.22	0.15	0.14
query13	0.46	0.48	0.47
query14	1.00	1.01	1.03
query15	0.61	0.61	0.61
query16	0.31	0.32	0.31
query17	1.12	1.08	1.10
query18	0.24	0.21	0.21
query19	2.04	1.95	1.93
query20	0.02	0.01	0.02
query21	15.44	0.21	0.14
query22	4.96	0.06	0.05
query23	16.17	0.31	0.12
query24	2.92	0.44	0.31
query25	0.11	0.06	0.03
query26	0.72	0.21	0.15
query27	0.04	0.04	0.03
query28	3.51	0.91	0.57
query29	12.47	4.28	3.43
query30	0.27	0.14	0.14
query31	2.77	0.62	0.31
query32	3.22	0.60	0.49
query33	3.20	3.17	3.25
query34	15.56	4.16	3.54
query35	3.54	3.53	3.58
query36	0.54	0.42	0.41
query37	0.09	0.07	0.06
query38	0.04	0.04	0.03
query39	0.04	0.03	0.02
query40	0.18	0.16	0.17
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 96.68 s
Total hot run time: 25.21 s

@github-actions

Copy link
Copy Markdown
Contributor

Codex automated review failed and did not complete.

Error: Codex completed, but no new pull request review was submitted for the current head SHA.
Workflow run: https://github.com/apache/doris/actions/runs/28016690716

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Routine load submit failures can renew a task directly from the scheduler after the task has begun a transaction. That path previously renewed the task without holding the job write lock, racing with scheduler idle-slot counting that reads the same task list. It could also race with a state transition that paused the job and cleared the task list between recording the submit error and renewing the task, allowing a replacement task to be queued for a non-running job. This change handles the submit-failure error update, resource-pressure delay, job state check, task membership check, and task renew decision while holding the job write lock. The scheduler only queues a replacement task when the job is still RUNNING and the failed task still belongs to the job.

### Release note

Fix routine load scheduler task renew race after submit failure.

### Check List (For Author)

- Test: Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.load.routineload.RoutineLoadTaskSchedulerTest
    - git diff --check
- Behavior changed: Yes. Submit-failure renew now skips tasks that were removed by a state transition or belong to a non-running job.
- Does this need documentation: No
@sollhui sollhui force-pushed the fix_rl_task_null branch from ae2d5f7 to c12ecb8 Compare June 23, 2026 11:50
@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/review

@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 28476 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c12ecb8798511d6aca8dbac0c46bb244c2d41284, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17851	3983	3979	3979
q2	2063	310	184	184
q3	10258	1389	851	851
q4	4676	464	336	336
q5	7647	851	566	566
q6	197	168	134	134
q7	767	820	613	613
q8	10018	1323	1502	1323
q9	5991	4510	4527	4510
q10	6808	1802	1497	1497
q11	441	277	245	245
q12	666	424	286	286
q13	18188	3381	2781	2781
q14	290	257	241	241
q15	q16	781	763	704	704
q17	4158	1079	719	719
q18	6883	5715	5463	5463
q19	1765	1415	1063	1063
q20	500	397	268	268
q21	5789	2591	2415	2415
q22	434	349	298	298
Total cold run time: 106171 ms
Total hot run time: 28476 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4393	4241	4271	4241
q2	318	340	226	226
q3	4536	4943	4399	4399
q4	2044	2156	1358	1358
q5	4413	4267	4317	4267
q6	227	177	124	124
q7	2310	1873	1656	1656
q8	2578	2122	2154	2122
q9	8198	8309	8162	8162
q10	4843	4734	4499	4499
q11	561	415	376	376
q12	745	737	522	522
q13	3238	3579	3103	3103
q14	293	294	269	269
q15	q16	718	744	623	623
q17	1368	1363	1344	1344
q18	7807	7316	6840	6840
q19	1154	1088	1087	1087
q20	2219	2233	1937	1937
q21	5187	4523	4361	4361
q22	525	451	418	418
Total cold run time: 57675 ms
Total hot run time: 51934 ms

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed for PR #64731 at head c12ecb8798511d6aca8dbac0c46bb244c2d41284.

I did not find any new inline issues beyond the existing RoutineLoadTaskScheduler.java:277 thread. The current head addresses that thread's race by holding the job write lock while recording the submit error, checking non-resource-pressure failures, verifying the job is still RUNNING, verifying the failed task still belongs to the job, and only then calling unprotectRenewTask.

Critical checkpoint conclusions: the goal is clear and narrow; the change is focused; the affected path is concurrent and now uses the same job write lock protecting routineLoadTaskInfoList; no new configuration, persistence format, or FE/BE protocol is introduced; both Kafka and Kinesis renew implementations are covered through the common scheduler path; and the added unit tests cover locked renew, removed-task skip, and paused-job skip. I could not run local FE UT because thirdparty/installed and thirdparty/installed/bin/protoc are missing in this checkout; GitHub CheckStyle is green and FE UT is still pending externally.

User focus: no additional user-provided focus points were supplied.

Subagent conclusions: optimizer-rewrite reported no candidates, and tests-session-config reported no candidates. No subagent candidate became an inline comment, and convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the final ledger/comment set.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171854 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c12ecb8798511d6aca8dbac0c46bb244c2d41284, data reload: false

query5	4337	646	469	469
query6	448	186	167	167
query7	4866	558	305	305
query8	378	219	193	193
query9	8727	3991	3979	3979
query10	436	310	255	255
query11	5940	2317	2139	2139
query12	173	100	105	100
query13	1288	616	419	419
query14	6394	5357	5004	5004
query14_1	4358	4345	4337	4337
query15	205	194	174	174
query16	1051	451	405	405
query17	1105	667	557	557
query18	2676	463	336	336
query19	197	177	138	138
query20	111	116	106	106
query21	217	136	120	120
query22	13589	13629	13403	13403
query23	17339	16486	16103	16103
query23_1	16182	16305	16169	16169
query24	7703	1801	1318	1318
query24_1	1330	1334	1322	1322
query25	561	468	401	401
query26	1292	330	167	167
query27	2716	597	357	357
query28	4504	2010	2009	2009
query29	1087	635	503	503
query30	313	238	199	199
query31	1117	1081	988	988
query32	102	62	61	61
query33	541	328	266	266
query34	1325	1133	655	655
query35	764	796	688	688
query36	1359	1459	1253	1253
query37	153	107	92	92
query38	1920	1736	1652	1652
query39	933	907	888	888
query39_1	885	857	873	857
query40	241	127	101	101
query41	71	68	68	68
query42	88	89	94	89
query43	321	328	285	285
query44	1460	769	771	769
query45	200	186	181	181
query46	1057	1255	742	742
query47	2403	2363	2220	2220
query48	405	410	298	298
query49	657	484	361	361
query50	1108	376	266	266
query51	4413	4468	4348	4348
query52	82	84	72	72
query53	251	263	192	192
query54	313	229	216	216
query55	74	70	68	68
query56	247	245	225	225
query57	1446	1417	1308	1308
query58	254	215	215	215
query59	1669	1735	1421	1421
query60	284	238	231	231
query61	148	139	163	139
query62	694	644	578	578
query63	230	188	187	187
query64	2481	735	605	605
query65	4806	4741	4767	4741
query66	1726	481	339	339
query67	29754	29699	29561	29561
query68	3155	1518	902	902
query69	410	290	253	253
query70	1122	1010	968	968
query71	330	226	209	209
query72	2982	2604	2291	2291
query73	838	750	462	462
query74	5120	4948	4742	4742
query75	2628	2599	2222	2222
query76	2336	1170	805	805
query77	356	376	282	282
query78	12322	12616	11843	11843
query79	1501	1185	804	804
query80	1262	457	385	385
query81	518	287	241	241
query82	688	156	116	116
query83	314	279	242	242
query84	262	147	112	112
query85	953	558	409	409
query86	546	290	287	287
query87	1843	1828	1762	1762
query88	3896	2767	2747	2747
query89	459	371	329	329
query90	1917	177	185	177
query91	171	160	132	132
query92	63	62	56	56
query93	1654	1493	892	892
query94	919	339	310	310
query95	691	455	335	335
query96	1081	768	359	359
query97	2686	2673	2567	2567
query98	212	223	196	196
query99	1210	1154	1044	1044
Total cold run time: 260499 ms
Total hot run time: 171854 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.08 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c12ecb8798511d6aca8dbac0c46bb244c2d41284, data reload: false

query1	0.01	0.01	0.01
query2	0.10	0.05	0.05
query3	0.29	0.14	0.14
query4	1.61	0.14	0.13
query5	0.23	0.22	0.22
query6	1.25	1.13	1.10
query7	0.04	0.01	0.01
query8	0.09	0.03	0.04
query9	0.38	0.30	0.31
query10	0.54	0.54	0.54
query11	0.20	0.14	0.13
query12	0.19	0.15	0.13
query13	0.48	0.47	0.48
query14	1.02	1.02	1.00
query15	0.62	0.59	0.60
query16	0.33	0.31	0.33
query17	1.12	1.09	1.07
query18	0.22	0.22	0.21
query19	1.99	1.88	1.96
query20	0.01	0.02	0.01
query21	15.45	0.22	0.14
query22	4.88	0.06	0.06
query23	16.05	0.32	0.11
query24	3.02	0.41	0.35
query25	0.17	0.07	0.04
query26	0.77	0.21	0.14
query27	0.05	0.04	0.04
query28	3.54	0.90	0.52
query29	12.52	4.31	3.44
query30	0.28	0.15	0.15
query31	2.77	0.58	0.31
query32	3.23	0.60	0.48
query33	3.26	3.17	3.29
query34	15.54	4.22	3.50
query35	3.53	3.51	3.55
query36	0.56	0.42	0.42
query37	0.08	0.07	0.06
query38	0.08	0.04	0.04
query39	0.04	0.03	0.03
query40	0.17	0.16	0.15
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 96.87 s
Total hot run time: 25.08 s

@liaoxin01 liaoxin01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 50.00% (5/10) 🎉
Increment coverage report
Complete coverage report

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@sollhui sollhui merged commit e5f3bad into apache:master Jun 24, 2026
36 checks passed
@sollhui sollhui deleted the fix_rl_task_null branch June 24, 2026 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants