DwarfStar 4 (Abliterated)

This is a personal fork of antirez/ds4 — Salvatore Sanfilippo's hand-written C inference engine for DeepSeek V4 Flash — with three things added so the cyberneurova abliterated GGUF loads and runs end-to-end on Apple Silicon. The result is a 284b-parameter mixture-of-experts model running on a single laptop, with the steerability dial under the user's control, at about 440 prefill tokens per second and 30 inference tokens/second on M5 Max.

The personal-computing arithmetic that flipped mainframe-vs-PC in the 1980s now applies to frontier AI: the centralized service is more powerful in aggregate, but the per-user slice has been overtaken on the desk (or laptop).

What this fork adds

main is antirez/ds4@HEAD with three things merged in:

feat(loader): support stock-recipe (Q8_0/F32) GGUFs end-to-end on Metal (branch support-q8_0-token-embd, sent upstream as PR #60). Makes ds4 accept GGUFs that the upstream llama.cpp converter produces without per-tensor type overrides — files where most small projections are Q8_0 and the routed-expert router is F32, instead of antirez's hand-tuned recipe where they all stay F16. The motivating case is the cyberneurova GGUFs, but the change is generic and unblocks any stock-recipe DS4 file.
ivanfioravanti's PR #15: Add Metal 4 M5 prefill optimizations. M5-class Metal 4 (MPP) tensor-API paths for Q8_0 dense matmul, attention output low-projection, and staged routed-MoE projections, plus a fused six-expert routed-MoE sum kernel. ~1.5x prefill speedup on M5 Max for q2 prompts.
fix(metal): correct M5 MPP + Q8_0 ape compressor for stock-recipe GGUFs (branch m5-support-q8_0-token-embd). Two fixes that close a regression where (1) + (2) together produced garbage output (BOS-token spam) for stock-recipe Q8_0 ape on M5: a CPU-side dequant for the prefill compressor APE byte-strided path (with a per-call MTLBuffer to avoid an encode-time race on the shared scratch), and a Q8_0 branch in the decode-time kernel_dsv4_compressor_store_one Metal kernel.

If you only want one of these, use the corresponding branch directly:

audreyt/ds4
├── main                          — all three merged
├── m5-support-q8_0-token-embd    — loader PR + ivan's PR #15 + the M5/cyber fix
├── support-q8_0-token-embd       — just the stock-recipe loader PR (the one I sent upstream)
└── (PR #15 lives at https://github.com/ivanfioravanti/ds4/tree/codex/metal4-m5-scaffold)

Separately, dir-steering/out/uncertainty.f32 (and its .json metadata) is a steering-direction artifact scoped to this fork — sitting alongside upstream's verbosity.f32 example. Unlike items 1–3 above, it hasn't been sent upstream. The runtime wiring that uses it (default ffn=-3 plus a hedge system prompt) lives in audreyt/pi-ds4; see its Directional steering section for what it does, why uncertainty rather than stance, and how to turn it off.

Why this fork exists

I run audreyt/pi-ds4 on a MacBook M5 Max and wanted the cyberneurova abliterated DeepSeek V4 Flash GGUFs to load without pre-converting the file or running a separate inference engine.

Stock antirez/ds4 rejects them at the loader — the recipes differ in ~360 tensor headers across 12 families. The loader PR (item 1 above) closes those gaps so the file just loads.

What's verified on M5

GGUF	flags	result
antirez recipe (q2 / q4)	defaults	works as upstream
cyberneurova `*-Q2_K.gguf`	defaults	works end-to-end including PR #15's MPP F16 prefill

Benchmarks

Prefill throughput on MacBook Pro M5 Max with the cyberneurova Q2_K GGUF (cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf, ~95 GB), --ctx 32768, 3 repeats averaged. "MPP off" sets DS4_METAL_MPP_DISABLE=1 (effectively the non-M5 path); "MPP on" is the default for audreyt/ds4 main, which includes ivan's PR #15 plus the M5/cyber compressor fix. Same command shape ivan used in PR #15: ./ds4 --prompt-file <prompt> -n 1 --nothink --ctx 32768.

prompt tokens	MPP off avg tok/s	MPP on avg tok/s	speedup
533	261.6	442.4	1.69x
2008	362.4	625.5	1.73x
4107	308.4	553.8	1.80x
8126	279.6	387.1	1.38x
16300	273.2	413.8	1.51x

The strategic point in those numbers: a single user on a single laptop is seeing prefill throughput in the same range commercial frontier-AI APIs deliver per user. Not because the M5 Max is faster than an H200 (it isn't), but because the laptop serves a batch of one and the operator amortizes their accelerator across N concurrent requests.

Time-to-first-token on a personal device now compares favorably to commercial APIs for many model classes. Decode throughput sits around 24-37 tok/s and doesn't change consistently with MPP, which matches PR #15's design (MPP is a prefill-only optimization). Prompts were built by concatenating this README's text and trimming to approximate target token counts.

Build and run

Build is unchanged from upstream:

make

To run against an unmodified cyberneurova GGUF on M5:

ln -sfn /path/to/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf ./ds4flash.gguf
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

To run against an antirez-recipe q2/q4 file, the upstream download_model.sh flow works as-is:

./download_model.sh q2
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

For a one-line install that handles the engine build, model download, and server lifecycle automatically, see audreyt/pi-ds4.

Acknowledgements

Salvatore Sanfilippo (antirez) for ds4 and the llama.cpp-deepseek-v4-flash converter that both this fork and the cyberneurova GGUFs depend on.
Ivan Fioravanti (ivanfioravanti) for the M5 Metal 4 / MPP optimization work in PR #15.
Georgi Gerganov and the llama.cpp / GGML community for the GGUF format, Metal kernel infrastructure, and quantization formats that all of this is built on. ds4's LICENSE retains the GGML copyright notice for that reason.
The cyberneurova research project for publishing the DeepSeek-V4-Flash abliterated GGUFs in the stock llama.cpp recipe — the motivating case for the loader PR.

The original upstream README (project design philosophy, model card, server/CLI documentation, disk KV cache format, test vectors) lives at antirez/ds4#readme. I haven't duplicated it here so this file stays focused on what's different about this fork.

License

MIT, matching upstream. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
dir-steering		dir-steering
metal		metal
speed-bench		speed-bench
tests		tests
.gitignore		.gitignore
AGENT.md		AGENT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
download_model.sh		download_model.sh
ds4.c		ds4.c
ds4.h		ds4.h
ds4_bench.c		ds4_bench.c
ds4_cli.c		ds4_cli.c
ds4_cuda.cu		ds4_cuda.cu
ds4_gpu.h		ds4_gpu.h
ds4_iq2_tables_cuda.inc		ds4_iq2_tables_cuda.inc
ds4_metal.m		ds4_metal.m
ds4_server.c		ds4_server.c
linenoise.c		linenoise.c
linenoise.h		linenoise.h
rax.c		rax.c
rax.h		rax.h
rax_malloc.h		rax_malloc.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DwarfStar 4 (Abliterated)

What this fork adds

Why this fork exists

What's verified on M5

Benchmarks

Build and run

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DwarfStar 4 (Abliterated)

What this fork adds

Why this fork exists

What's verified on M5

Benchmarks

Build and run

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages