Skip to content

feat: support vllm inference for GGUF models#2020

Draft
zhehli688 wants to merge 1 commit into
kaito-project:mainfrom
zhehli688:gguf
Draft

feat: support vllm inference for GGUF models#2020
zhehli688 wants to merge 1 commit into
kaito-project:mainfrom
zhehli688:gguf

Conversation

@zhehli688

@zhehli688 zhehli688 commented May 4, 2026

Copy link
Copy Markdown
Collaborator

Reason for Change:

GGUF (GPT-Generated Unified Format) is a binary file format for storing large language models (LLMs), introduced by the llama.cpp project and designed for efficient local inference. Currently it has experimental support in vLLM: https://docs.vllm.ai/en/stable/features/quantization/gguf/. Caveat: there is an ongoing disucssion about whether to keep GGUF support in vLLM: vllm-project/vllm#39583.

This PR enabled GGUF model deployment in Kaito through the non-catalog path (model catalog support for GGUF models will be added in another PR). Changes include:

  • Added preset options hfConfigPath and tokenizer and passed them to vLLM when available. hfConfigPath supports user-provided path to HF model config and is used when a GGUF model is missing config.json. tokenizersupports user-provided path to base model's tokenizer as suggusted by vLLM (https://docs.vllm.ai/en/stable/features/quantization/gguf/).
  • Allowed quantization type of GGUF models to be provided as model name suffix, e.g. Qwen/Qwen3-4B-GGUF:Q4_K_M. This is consistent with how model name and quantization type should be passed to vLLM.
  • Refactored GetModelByName to preserve the original case of model repo ID for GGUF models (i.e. Qwen/Qwen3-4B-GGUF:Q4_K_M instead of qwen/qwen3-4b-gguf:q4_k_m) and avoid model resolvement failures (this may be a bug in vLLM given GGUF support is experimental).
  • Updated _resolve_processor() in benchmark_entrypoint.py to scan default HF cache as well. This is needed when the original HF repo doesn't contain tokenizer.json (e.g. unsloth/Qwen3-8B-GGUF) and the model needs to use a tokenizer from another repo. Since benchmark_entrypoint.py is part of Kaito's base image, we need to bump base image version to 0.3.1.

Requirements

  • added unit tests and e2e tests (if applicable).
  • tested deployment of models Qwen/Qwen3-4B-GGUF:Q4_K_M (missing config.json) and unsloth/Qwen3-8B-GGUF:Q4_K_M:
Screenshot 2026-05-01 at 4 37 41 PM

Issue Fixed:

Fixes #2002

Notes for Reviewers:

@kaito-pr-agent

kaito-pr-agent Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Title

Support vLLM inference for GGUF models with config and tokenizer options


Description

  • Enable GGUF model deployment in KAITO via vLLM

  • Add hfConfigPath and tokenizer preset options for GGUF

  • Support quantization type suffix in model name

  • Update CRDs and controllers to propagate new options


Changes walkthrough 📝

Relevant files
Enhancement
11 files
generator.go
Added GGUF parsing and generation logic                                   
+102/-7 
vllm_model.go
Updated model retrieval functions with new options             
+22/-14 
workspace_types.go
Added HFConfigPath and Tokenizer fields to PresetOptions 
+10/-0   
interfaces.go
Added HFConfigPath and Tokenizer to ModelProfile                 
+5/-0     
estimator.go
Updated estimator to pass new options                                       
+1/-1     
helper.go
Updated helper to populate new options                                     
+4/-2     
workspace_controller.go
Updated controller to pass new options                                     
+2/-2     
workspace_validation.go
Updated validation to pass new options                                     
+8/-2     
inference_config_validation.go
Updated validation to pass new options                                     
+1/-1     
node_claim.go
Updated resource logic to pass new options                             
+1/-1     
interface.go
Updated inference command build logic                                       
+8/-4     
Tests
4 files
generator_test.go
Added unit tests for GGUF support                                               
+173/-12
preset_vllm_test.go
Added e2e test for GGUF workspace creation                             
+40/-0   
preset_test.go
Added GGUF model constant                                                               
+1/-0     
vllm_model_test.go
Updated model tests with new function signatures                 
+10/-10 
Documentation
1 files
kaito_workspace_qwen_3_4b_gguf.yaml
Added example YAML for GGUF workspace                                       
+15/-0   
Configuration changes
4 files
kaito.sh_workspaces.yaml
Updated CRD schema for new preset options                               
+24/-0   
kaito.sh_workspaces.yaml
Updated CRD schema for new preset options                               
+24/-0   
kaito.sh_inferencesets.yaml
Updated CRD schema for new preset options                               
+12/-0   
kaito.sh_inferencesets.yaml
Updated CRD schema for new preset options                               
+12/-0   

Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @kaito-pr-agent

    kaito-pr-agent Bot commented May 4, 2026

    Copy link
    Copy Markdown
    Contributor

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Case Sensitivity Inconsistency

    The PR preserves original case for GGUF models but lowercases non-GGUF models in NewGenerator(). This creates inconsistent behavior where GGUF model registry keys differ from non-GGUF models. Verify that GetModelByName() lookup logic handles both cases correctly and that existing non-GGUF models continue to work with their lowercase registry keys.

    // Lowercase non-GGUF repo IDs for consistent registry keys and prefix matching.
    // Preserve original case for GGUF models to avoid model resolvement issues on vLLM.
    if !isGGUF {
    	modelRepo = strings.ToLower(modelRepo)
    }
    Function Signature Changes

    GetModelByName() and GetModelByNameWithToken() now have additional parameters (hfConfigPath, tokenizer). While tests appear updated, verify all production call sites in workspace_controller.go, workspace_validation.go, and inference_config_validation.go are passing these parameters correctly. Missing parameters could cause GGUF models to fail config loading.

    func GetModelByNameWithToken(ctx context.Context, modelName, token, hfConfigPath, tokenizer string) (model.Model, error) {
    	lookupKey := strings.ToLower(modelName)
    	// Redirect legacy preset names (e.g. "phi-4") to their full HuggingFace
    	// model ID (e.g. "microsoft/phi-4").
    	if hfName, ok := plugin.LegacyBuiltinToCatalog[lookupKey]; ok {
    		modelName = hfName
    		lookupKey = strings.ToLower(hfName)
    	}
    	if m := plugin.KaitoModelRegister.MustGet(lookupKey); m != nil {
    		return m, nil
    	}
    	if strings.Contains(modelName, "/") {
    		return generateHuggingFaceModel(modelName, token, hfConfigPath, tokenizer)
    	}
    	return nil, fmt.Errorf("model is not registered: %s", modelName)
    }
    
    // GetModelByName returns a vLLM-compatible model for the given modelName.
    // If the modelName contains a "/", it fetches an access token from the
    // Kubernetes Secret identified by secretName and secretNamespace,
    // then generates a preset for the corresponding HuggingFace model.
    // Prefer GetModelByNameWithToken when the token has already been resolved by the caller.
    func GetModelByName(ctx context.Context, modelName, secretName, secretNamespace, hfConfigPath, tokenizer string, kubeClient client.Client) (model.Model, error) {
    	lookupKey := strings.ToLower(modelName)
    	// Redirect legacy preset names (e.g. "phi-4") to their full HuggingFace
    	// model ID (e.g. "microsoft/phi-4").
    	if hfName, ok := plugin.LegacyBuiltinToCatalog[lookupKey]; ok {
    		modelName = hfName
    		lookupKey = strings.ToLower(hfName)
    	}
    	if m := plugin.KaitoModelRegister.MustGet(lookupKey); m != nil {
    		return m, nil
    	}
    	if !strings.Contains(modelName, "/") {
    		return nil, fmt.Errorf("model is not registered: %s", modelName)
    	}
    	klog.InfoS("Generating VLLM model preset for HuggingFace model", "model", modelName, "secretName", secretName, "secretNamespace", secretNamespace)
    	token, err := GetHFTokenFromSecret(ctx, kubeClient, secretName, secretNamespace)
    	if err != nil {
    		// only log the error here since token may not be required for public models
    		klog.ErrorS(err, "failed to get huggingface token from secret", "secretName", secretName, "secretNamespace", secretNamespace)
    	}
    	return generateHuggingFaceModel(modelName, token, hfConfigPath, tokenizer)
    GGUF Model Name Parsing Edge Cases

    The colon parsing logic (lines 223-230) assumes quant type comes after the last colon. Verify this handles edge cases like model repos with colons in other contexts, or quant types that might contain special characters. Also confirm the condition strings.Contains(modelRepo[:idx], "/") correctly identifies repo_id:quant format vs other URL schemes.

    if idx := strings.LastIndex(modelRepo, ":"); idx > 0 {
    	// Only split if the colon is after a "/" (i.e., it's repo_id:quant, not a scheme)
    	if strings.Contains(modelRepo[:idx], "/") {
    		quantType = modelRepo[idx+1:]
    		modelRepo = modelRepo[:idx]
    		isGGUF = true
    	}
    }

    @kaito-pr-agent

    kaito-pr-agent Bot commented May 4, 2026

    Copy link
    Copy Markdown
    Contributor

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix sharded GGUF model support by returning all matching files

    The current implementation returns only the first matching GGUF file, which will
    fail for sharded models where multiple files match the quant type. Modify the
    function to collect and return all matching files to ensure complete model loading
    and accurate size calculation.

    presets/workspace/generator/generator.go [396-405]

     func (g *Generator) selectGGUFFile(ggufs []FileInfo) []FileInfo {
         quantUpper := strings.ToUpper(g.QuantType)
    +    var selected []FileInfo
         for _, f := range ggufs {
             fileUpper := strings.ToUpper(f.Path)
             if strings.Contains(fileUpper, quantUpper) {
    -            return []FileInfo{f}
    +            selected = append(selected, f)
             }
         }
    -    return nil
    +    if len(selected) == 0 {
    +        return nil
    +    }
    +    return selected
     }
    Suggestion importance[1-10]: 8

    __

    Why: The current implementation returns only the first matching GGUF file, which causes incorrect model size calculation and potential download failures for sharded GGUF models. The suggestion correctly modifies selectGGUFFile to collect all matching files, ensuring consistency with how other model formats (e.g., safetensors) are handled in selectWeightFiles.

    Medium
    General
    Add regex pattern validation for HuggingFace repo ID format

    Add a pattern constraint to the hfConfigPath and tokenizer fields in the CRD schema
    to ensure they follow the HuggingFace repo ID format (e.g., org/model). This
    prevents invalid values from being accepted at the API level.

    config/crd/bases/kaito.sh_workspaces.yaml [499-504]

     hfConfigPath:
       description: |-
         HFConfigPath specifies an alternate HuggingFace repo ID to fetch config.json from.
         This is needed for GGUF model repos that do not include a config.json file.
         When set, --hf-config-path is passed to vLLM at runtime. Example: "Qwen/Qwen3-4B".
    +  pattern: "^[^/]+/[^/]+$"
       type: string
    Suggestion importance[1-10]: 6

    __

    Why: Adding a pattern constraint improves input validation for CRD fields, preventing invalid repo IDs. However, it is an enhancement rather than a critical fix, and validation logic is already handled in the Go code.

    Low
    Add regex pattern validation to InferenceSet CRD fields

    Apply the same pattern constraint to the hfConfigPath and tokenizer fields in the
    InferenceSet CRD schema to maintain consistency with the Workspace CRD and ensure
    input validation.

    config/crd/bases/kaito.sh_inferencesets.yaml [190-195]

     hfConfigPath:
       description: |-
         HFConfigPath specifies an alternate HuggingFace repo ID to fetch config.json from.
         This is needed for GGUF model repos that do not include a config.json file.
         When set, --hf-config-path is passed to vLLM at runtime. Example: "Qwen/Qwen3-4B".
    +  pattern: "^[^/]+/[^/]+$"
       type: string
    Suggestion importance[1-10]: 6

    __

    Why: Consistency with the Workspace CRD validation is good practice to ensure input quality across both resources, though it is not strictly required for the PR's core functionality.

    Low
    Add trailing newline to example YAML file

    Ensure the example YAML file ends with a newline character. This is a standard
    convention for text files and prevents issues with some tools that expect a trailing
    newline.

    examples/inference/kaito_workspace_qwen_3_4b_gguf.yaml [10-15]

    +inference:
    +  preset:
    +    name: "Qwen/Qwen3-4B-GGUF:Q4_K_M"
    +    presetOptions:
    +      hfConfigPath: "Qwen/Qwen3-4B"
    +      tokenizer: "Qwen/Qwen3-4B"
     
    -
    Suggestion importance[1-10]: 2

    __

    Why: Ensuring a trailing newline is a standard text file convention, but it has minimal impact on functionality or correctness compared to other changes in the PR.

    Low

    @andyzhangx andyzhangx requested a review from Copilot May 10, 2026 02:04

    Copilot AI left a comment

    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull request overview

    This PR adds support for deploying GGUF-quantized models with the vLLM runtime via the non-catalog (HuggingFace repo ID) preset path, including optional user-provided tokenizer/config sources to handle GGUF repos that lack tokenizer.json or config.json.

    Changes:

    • Extend preset options and model generation to support repo_id:quant_type GGUF model identifiers, plus optional hfConfigPath and tokenizer overrides passed through to vLLM.
    • Update benchmark processor resolution to also scan the default HuggingFace cache (requires base image tag bump to 0.3.1).
    • Add/adjust unit and e2e tests plus an example Workspace manifest for GGUF.

    Reviewed changes

    Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.

    Show a summary per file
    File Description
    test/e2e/preset_vllm_test.go Adds an e2e scenario for a GGUF model preset and sets tokenizer preset option.
    test/e2e/preset_test.go Adds GGUF preset constant and strips :quant suffix when deriving served model name in tests.
    presets/workspace/models/vllm_model.go Threads hfConfigPath/tokenizer through model resolution and generator options; normalizes registry key casing.
    presets/workspace/models/vllm_model_test.go Updates tests for new GetModelByName* signatures and generator invocation.
    presets/workspace/models/supported_models.yaml Bumps base image tag to 0.3.1 and documents tag history note.
    presets/workspace/inference/vllm/benchmark_entrypoint.py Updates processor resolution to scan both /workspace/weights and default HF cache.
    presets/workspace/generator/generator.go Adds GGUF file detection, quant suffix parsing, GGUF vLLM params, and hfConfigPath config fallback; introduces options struct API.
    presets/workspace/generator/generator_test.go Adds GGUF-focused unit tests and updates expected normalized repo casing.
    pkg/workspace/resource/node_claim.go Passes hfConfigPath/tokenizer into preset lookup when sizing OS disk.
    pkg/workspace/estimator/nodesestimator/estimator.go Passes hfConfigPath/tokenizer into model lookup for node estimation.
    pkg/workspace/estimator/interfaces.go Extends ModelProfile to include HFConfigPath and Tokenizer.
    pkg/workspace/controllers/workspace_controller.go Passes hfConfigPath/tokenizer through model resolution for tuning/inference reconciliation.
    pkg/utils/workspace/helper.go Includes HFConfigPath/Tokenizer in node estimation requests derived from Workspace.
    pkg/model/interface.go Avoids overwriting generator-provided --model (needed for GGUF repo_id:quant_type).
    examples/inference/kaito_workspace_qwen_3_4b_gguf.yaml Adds an example Workspace manifest using GGUF + hfConfigPath + tokenizer.
    config/crd/bases/kaito.sh_workspaces.yaml Adds CRD schema for presetOptions.hfConfigPath and presetOptions.tokenizer.
    config/crd/bases/kaito.sh_inferencesets.yaml Adds CRD schema for presetOptions.hfConfigPath and presetOptions.tokenizer for InferenceSets.
    charts/kaito/workspace/templates/supported-models-configmap.yaml Keeps Helm-generated supported models config in sync with base tag 0.3.1.
    charts/kaito/workspace/templates/kaito.sh_workspaces.yaml Mirrors CRD schema additions in Helm chart template.
    charts/kaito/workspace/templates/kaito.sh_inferencesets.yaml Mirrors CRD schema additions in Helm chart template.
    api/v1beta1/workspace_validation.go Threads hfConfigPath/tokenizer into preset lookup during validation/resource checks.
    api/v1beta1/workspace_types.go Adds PresetOptions.HFConfigPath and PresetOptions.Tokenizer to the API type.
    api/v1beta1/inference_config_validation.go Threads hfConfigPath/tokenizer into preset lookup for max-model-len validation.

    💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    Comment thread api/v1beta1/workspace_validation.go Outdated
    Comment on lines 497 to 502
    if inference.Preset != nil {
    hfConfigPath = inference.Preset.PresetOptions.HFConfigPath
    tokenizer = inference.Preset.PresetOptions.Tokenizer
    }
    modelPreset, err := models.GetModelByName(context.TODO(), presetName, secretName, wsNamespace, hfConfigPath, tokenizer, k8sclient.Client) // InferenceSpec has been validated so the name is valid.
    if err != nil {
    Comment on lines 98 to 102
    if w.Inference != nil && w.Inference.Preset != nil {
    presetName := strings.ToLower(string(w.Inference.Preset.Name))
    if plugin.IsValidPreset(presetName) {
    modelPreset, err := models.GetModelByName(ctx, presetName, w.Inference.Preset.PresetOptions.ModelAccessSecret, w.Namespace, k8sclient.Client)
    modelPreset, err := models.GetModelByName(ctx, presetName, w.Inference.Preset.PresetOptions.ModelAccessSecret, w.Namespace, w.Inference.Preset.PresetOptions.HFConfigPath, w.Inference.Preset.PresetOptions.Tokenizer, k8sclient.Client)
    if err != nil {
    Comment on lines +184 to +201
    # Case 2: DAR or HF cache — ask huggingface_hub for the repo_id.
    all_repos = []
    default_hf_cache = Path.home() / ".cache" / "huggingface" / "hub"
    for cache_dir in [weights, default_hf_cache]:
    if not cache_dir.exists():
    continue
    try:
    cache_info = scan_cache_dir(cache_dir=str(cache_dir))
    all_repos.extend(cache_info.repos)
    except Exception:
    pass

    if all_repos:
    repos = sorted(all_repos, key=lambda r: r.repo_id)
    # repo_id is the HuggingFace model identifier (e.g. "microsoft/Phi-3-mini-4k-instruct"),
    # not a local path. guidellm/vLLM accept it as the --processor value and resolve the
    # tokenizer from the HF Hub (or local cache if HF_HUB_OFFLINE is set).
    return repos[0].repo_id
    Comment on lines +398 to +408
    // selectGGUFFile finds the GGUF file matching the requested quant type.
    // It matches files containing the quant type string (case-insensitive).
    func (g *Generator) selectGGUFFile(ggufs []FileInfo) []FileInfo {
    quantUpper := strings.ToUpper(g.QuantType)
    for _, f := range ggufs {
    fileUpper := strings.ToUpper(f.Path)
    if strings.Contains(fileUpper, quantUpper) {
    return []FileInfo{f}
    }
    }
    return nil
    Comment thread api/v1beta1/workspace_types.go Outdated
    // This is needed for GGUF model repos that do not include a config.json file.
    // When set, --hf-config-path is passed to vLLM at runtime. Example: "Qwen/Qwen3-4B".
    // +optional
    HFConfigPath string `json:"hfConfigPath,omitempty"`

    Copy link
    Copy Markdown
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    do you have example of this field? in L80, Example "Qwen/Qwen3-4B" is the hf model card id, which is same as ModelName

    Comment thread api/v1beta1/workspace_types.go Outdated
    // This is useful for GGUF models whose embedded tokenizer may not load correctly.
    // When set, --tokenizer is passed to vLLM at runtime. Example: "Qwen/Qwen3-4B".
    // +optional
    Tokenizer string `json:"tokenizer,omitempty"`

    Copy link
    Copy Markdown
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    do you have example of this field? Example: "Qwen/Qwen3-4B" is wrong here?

    @zhehli688 zhehli688 marked this pull request as draft May 13, 2026 16:54
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    Status: No status

    Development

    Successfully merging this pull request may close these issues.

    Support quantized models (GGUF and AWQ formats)

    3 participants