feat: support vllm inference for GGUF models#2020
Conversation
TitleSupport vLLM inference for GGUF models with config and tokenizer options Description
Changes walkthrough 📝
|
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Pull request overview
This PR adds support for deploying GGUF-quantized models with the vLLM runtime via the non-catalog (HuggingFace repo ID) preset path, including optional user-provided tokenizer/config sources to handle GGUF repos that lack tokenizer.json or config.json.
Changes:
- Extend preset options and model generation to support
repo_id:quant_typeGGUF model identifiers, plus optionalhfConfigPathandtokenizeroverrides passed through to vLLM. - Update benchmark processor resolution to also scan the default HuggingFace cache (requires base image tag bump to
0.3.1). - Add/adjust unit and e2e tests plus an example Workspace manifest for GGUF.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
test/e2e/preset_vllm_test.go |
Adds an e2e scenario for a GGUF model preset and sets tokenizer preset option. |
test/e2e/preset_test.go |
Adds GGUF preset constant and strips :quant suffix when deriving served model name in tests. |
presets/workspace/models/vllm_model.go |
Threads hfConfigPath/tokenizer through model resolution and generator options; normalizes registry key casing. |
presets/workspace/models/vllm_model_test.go |
Updates tests for new GetModelByName* signatures and generator invocation. |
presets/workspace/models/supported_models.yaml |
Bumps base image tag to 0.3.1 and documents tag history note. |
presets/workspace/inference/vllm/benchmark_entrypoint.py |
Updates processor resolution to scan both /workspace/weights and default HF cache. |
presets/workspace/generator/generator.go |
Adds GGUF file detection, quant suffix parsing, GGUF vLLM params, and hfConfigPath config fallback; introduces options struct API. |
presets/workspace/generator/generator_test.go |
Adds GGUF-focused unit tests and updates expected normalized repo casing. |
pkg/workspace/resource/node_claim.go |
Passes hfConfigPath/tokenizer into preset lookup when sizing OS disk. |
pkg/workspace/estimator/nodesestimator/estimator.go |
Passes hfConfigPath/tokenizer into model lookup for node estimation. |
pkg/workspace/estimator/interfaces.go |
Extends ModelProfile to include HFConfigPath and Tokenizer. |
pkg/workspace/controllers/workspace_controller.go |
Passes hfConfigPath/tokenizer through model resolution for tuning/inference reconciliation. |
pkg/utils/workspace/helper.go |
Includes HFConfigPath/Tokenizer in node estimation requests derived from Workspace. |
pkg/model/interface.go |
Avoids overwriting generator-provided --model (needed for GGUF repo_id:quant_type). |
examples/inference/kaito_workspace_qwen_3_4b_gguf.yaml |
Adds an example Workspace manifest using GGUF + hfConfigPath + tokenizer. |
config/crd/bases/kaito.sh_workspaces.yaml |
Adds CRD schema for presetOptions.hfConfigPath and presetOptions.tokenizer. |
config/crd/bases/kaito.sh_inferencesets.yaml |
Adds CRD schema for presetOptions.hfConfigPath and presetOptions.tokenizer for InferenceSets. |
charts/kaito/workspace/templates/supported-models-configmap.yaml |
Keeps Helm-generated supported models config in sync with base tag 0.3.1. |
charts/kaito/workspace/templates/kaito.sh_workspaces.yaml |
Mirrors CRD schema additions in Helm chart template. |
charts/kaito/workspace/templates/kaito.sh_inferencesets.yaml |
Mirrors CRD schema additions in Helm chart template. |
api/v1beta1/workspace_validation.go |
Threads hfConfigPath/tokenizer into preset lookup during validation/resource checks. |
api/v1beta1/workspace_types.go |
Adds PresetOptions.HFConfigPath and PresetOptions.Tokenizer to the API type. |
api/v1beta1/inference_config_validation.go |
Threads hfConfigPath/tokenizer into preset lookup for max-model-len validation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if inference.Preset != nil { | ||
| hfConfigPath = inference.Preset.PresetOptions.HFConfigPath | ||
| tokenizer = inference.Preset.PresetOptions.Tokenizer | ||
| } | ||
| modelPreset, err := models.GetModelByName(context.TODO(), presetName, secretName, wsNamespace, hfConfigPath, tokenizer, k8sclient.Client) // InferenceSpec has been validated so the name is valid. | ||
| if err != nil { |
| if w.Inference != nil && w.Inference.Preset != nil { | ||
| presetName := strings.ToLower(string(w.Inference.Preset.Name)) | ||
| if plugin.IsValidPreset(presetName) { | ||
| modelPreset, err := models.GetModelByName(ctx, presetName, w.Inference.Preset.PresetOptions.ModelAccessSecret, w.Namespace, k8sclient.Client) | ||
| modelPreset, err := models.GetModelByName(ctx, presetName, w.Inference.Preset.PresetOptions.ModelAccessSecret, w.Namespace, w.Inference.Preset.PresetOptions.HFConfigPath, w.Inference.Preset.PresetOptions.Tokenizer, k8sclient.Client) | ||
| if err != nil { |
| # Case 2: DAR or HF cache — ask huggingface_hub for the repo_id. | ||
| all_repos = [] | ||
| default_hf_cache = Path.home() / ".cache" / "huggingface" / "hub" | ||
| for cache_dir in [weights, default_hf_cache]: | ||
| if not cache_dir.exists(): | ||
| continue | ||
| try: | ||
| cache_info = scan_cache_dir(cache_dir=str(cache_dir)) | ||
| all_repos.extend(cache_info.repos) | ||
| except Exception: | ||
| pass | ||
|
|
||
| if all_repos: | ||
| repos = sorted(all_repos, key=lambda r: r.repo_id) | ||
| # repo_id is the HuggingFace model identifier (e.g. "microsoft/Phi-3-mini-4k-instruct"), | ||
| # not a local path. guidellm/vLLM accept it as the --processor value and resolve the | ||
| # tokenizer from the HF Hub (or local cache if HF_HUB_OFFLINE is set). | ||
| return repos[0].repo_id |
| // selectGGUFFile finds the GGUF file matching the requested quant type. | ||
| // It matches files containing the quant type string (case-insensitive). | ||
| func (g *Generator) selectGGUFFile(ggufs []FileInfo) []FileInfo { | ||
| quantUpper := strings.ToUpper(g.QuantType) | ||
| for _, f := range ggufs { | ||
| fileUpper := strings.ToUpper(f.Path) | ||
| if strings.Contains(fileUpper, quantUpper) { | ||
| return []FileInfo{f} | ||
| } | ||
| } | ||
| return nil |
| // This is needed for GGUF model repos that do not include a config.json file. | ||
| // When set, --hf-config-path is passed to vLLM at runtime. Example: "Qwen/Qwen3-4B". | ||
| // +optional | ||
| HFConfigPath string `json:"hfConfigPath,omitempty"` |
There was a problem hiding this comment.
do you have example of this field? in L80, Example "Qwen/Qwen3-4B" is the hf model card id, which is same as ModelName
| // This is useful for GGUF models whose embedded tokenizer may not load correctly. | ||
| // When set, --tokenizer is passed to vLLM at runtime. Example: "Qwen/Qwen3-4B". | ||
| // +optional | ||
| Tokenizer string `json:"tokenizer,omitempty"` |
There was a problem hiding this comment.
do you have example of this field? Example: "Qwen/Qwen3-4B" is wrong here?
Reason for Change:
GGUF (GPT-Generated Unified Format) is a binary file format for storing large language models (LLMs), introduced by the llama.cpp project and designed for efficient local inference. Currently it has experimental support in vLLM: https://docs.vllm.ai/en/stable/features/quantization/gguf/. Caveat: there is an ongoing disucssion about whether to keep GGUF support in vLLM: vllm-project/vllm#39583.
This PR enabled GGUF model deployment in Kaito through the non-catalog path (model catalog support for GGUF models will be added in another PR). Changes include:
hfConfigPathandtokenizerand passed them to vLLM when available.hfConfigPathsupports user-provided path to HF model config and is used when a GGUF model is missing config.json.tokenizersupports user-provided path to base model's tokenizer as suggusted by vLLM (https://docs.vllm.ai/en/stable/features/quantization/gguf/).Requirements
Issue Fixed:
Fixes #2002
Notes for Reviewers: