Skip to content

fix(qwen3_5): add linear_attn entries to base_model_tp_plan#46847

Open
muhamedfazalps wants to merge 2 commits into
huggingface:mainfrom
muhamedfazalps:fix/qwen3-5-tp-plan
Open

fix(qwen3_5): add linear_attn entries to base_model_tp_plan#46847
muhamedfazalps wants to merge 2 commits into
huggingface:mainfrom
muhamedfazalps:fix/qwen3-5-tp-plan

Conversation

@muhamedfazalps

Copy link
Copy Markdown

Fixes #46846

Qwen3.5 uses linear_attention in ~75% of layers. base_model_tp_plan was missing linear_attn entries, causing OOM at TP>1.

Fix: add colwise_gather_output for in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj.

This shards weights and all-gathers before conv1d which needs full channels.

Fixes huggingface#46846

Qwen3.5 is a hybrid model where ~75% of decoder layers use
linear_attention (Qwen3_5GatedDeltaNet). The base_model_tp_plan
only covered self_attn and mlp layers, leaving linear_attn weights
unsharded. This caused OOM at TP>1 and RuntimeError during generate.

Added colwise_gather_output entries for all linear_attn projections:
- in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj

colwise_gather_output shards weights across ranks (fixing OOM) and
all-gathers activations before the depthwise Conv1d, which requires
the full channel dimension.
…lar file

The check_modular_conversion CI step compares the generated configuration
(from modular_qwen3_5.py) with configuration_qwen3_5.py. The modular file
includes linear_attn entries in base_model_tp_plan but the configuration
file was missing them, causing the consistency check to fail.
@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_5

@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

@Rocketknight1

Copy link
Copy Markdown
Member

cc @vasqu since you were replying in the original issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Qwen3.5] Missing linear_attn entries in base_model_tp_plan causes OOM and shape error at TP>1

2 participants