[fix](job) lock routine load task renew on submit failure#64731
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
|
run buildall |
8cd389b to
ae2d5f7
Compare
|
run buildall |
|
/review |
TPC-H: Total hot run time: 29139 ms |
TPC-DS: Total hot run time: 173022 ms |
There was a problem hiding this comment.
I found one concurrency issue in the submit-failure renew path.
Critical checkpoints:
- Goal/test: the PR aims to protect routine-load task renew after submit failure with the job write lock. The new unit test proves the lock is held during
unprotectRenewTask, but it does not cover concurrent job state transitions that remove the task before this lock is acquired. - Scope: the code change is narrow and follows the existing lock style for renew paths.
- Concurrency/lifecycle: still problematic. A pause can clear
routineLoadTaskInfoListbetweensetOtherMsg()and the new locked renew, after which the renew adds a replacement task to a paused job. - Config/compatibility/persistence/observability: no new config, protocol, storage-format, or persistence changes were introduced.
- Parallel paths: Kafka and Kinesis renew implementations both unconditionally add the replacement after
remove(oldTask), so the state/member check needs to happen before calling them or inside a shared renew helper. - Test coverage: unit coverage was added for the lock contract, but coverage is missing for the concurrent pause/removal case.
- User focus: no additional user-provided review focus.
Subagent conclusions: OPT-1 and TSC-1..TSC-4 were dismissed because they target files outside GitHub's authoritative PR 64731 patch. MAIN-1 is the only accepted inline comment. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for this ledger/comment set.
|
Codex automated review failed and did not complete. Error: Codex completed, but no new pull request review was submitted for the current head SHA. Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
ClickBench: Total hot run time: 25.21 s |
|
Codex automated review failed and did not complete. Error: Codex completed, but no new pull request review was submitted for the current head SHA. Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Routine load submit failures can renew a task directly from the scheduler after the task has begun a transaction. That path previously renewed the task without holding the job write lock, racing with scheduler idle-slot counting that reads the same task list. It could also race with a state transition that paused the job and cleared the task list between recording the submit error and renewing the task, allowing a replacement task to be queued for a non-running job. This change handles the submit-failure error update, resource-pressure delay, job state check, task membership check, and task renew decision while holding the job write lock. The scheduler only queues a replacement task when the job is still RUNNING and the failed task still belongs to the job.
### Release note
Fix routine load scheduler task renew race after submit failure.
### Check List (For Author)
- Test: Unit Test
- ./run-fe-ut.sh --run org.apache.doris.load.routineload.RoutineLoadTaskSchedulerTest
- git diff --check
- Behavior changed: Yes. Submit-failure renew now skips tasks that were removed by a state transition or belong to a non-running job.
- Does this need documentation: No
ae2d5f7 to
c12ecb8
Compare
|
/review |
|
run buildall |
TPC-H: Total hot run time: 28476 ms |
There was a problem hiding this comment.
Review completed for PR #64731 at head c12ecb8798511d6aca8dbac0c46bb244c2d41284.
I did not find any new inline issues beyond the existing RoutineLoadTaskScheduler.java:277 thread. The current head addresses that thread's race by holding the job write lock while recording the submit error, checking non-resource-pressure failures, verifying the job is still RUNNING, verifying the failed task still belongs to the job, and only then calling unprotectRenewTask.
Critical checkpoint conclusions: the goal is clear and narrow; the change is focused; the affected path is concurrent and now uses the same job write lock protecting routineLoadTaskInfoList; no new configuration, persistence format, or FE/BE protocol is introduced; both Kafka and Kinesis renew implementations are covered through the common scheduler path; and the added unit tests cover locked renew, removed-task skip, and paused-job skip. I could not run local FE UT because thirdparty/installed and thirdparty/installed/bin/protoc are missing in this checkout; GitHub CheckStyle is green and FE UT is still pending externally.
User focus: no additional user-provided focus points were supplied.
Subagent conclusions: optimizer-rewrite reported no candidates, and tests-session-config reported no candidates. No subagent candidate became an inline comment, and convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the final ledger/comment set.
TPC-DS: Total hot run time: 171854 ms |
ClickBench: Total hot run time: 25.08 s |
|
PR approved by at least one committer and no changes requested. |
FE Regression Coverage ReportIncrement line coverage |
|
PR approved by anyone and no changes requested. |
What problem does this PR solve?
Routine load submit failures can renew a task directly from the scheduler after the task has begun a transaction. That path mutates the job's
routineLoadTaskInfoListwithout holding the job write lock, racing with scheduler idle-slot counting that reads the same list. This PR protects the submit-failure renew path with the job write lock, matching the existing timeout and transaction-status renew paths, and adds unit coverage for the locking behavior.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)