DUAL RTX 3090 QWEN 3.6-27B TUNING: 22.8 TOK/S WITH LLAMA.CPP
Default llama.cpp settings on a dual RTX 3090 rig with Ryzen 9 9950X3D deliver 19 tok/s — and fail to load a 163K context window at all. After 25 Bayesian optimisation trials across five tuning phases, the final configuration reaches 22.8 tok/s generation with 1,638 tok/s prefill bursts and enables 163,840-token context without OOM.
Who This Is For: Multi-GPU llama.cpp users who want every last tok/s, anyone fighting 24 GB VRAM limits at large context, Ryzen 9950X3D owners wondering if V-Cache matters for inference, or anyone who hit silent CUDA OOMs with speculative decoding. Not for Ollama one-command setups.
What You Will Learn: Optimal thread count for dual-CCD Zen 5, why Q8_0 KV cache is mandatory at 163K context, reliable speculative decoding on dual GPUs, batch-size scaling impact on prefill, and the exact v8.4 configuration that hits 22.8 tok/s.
Hardware Baseline
The tuning methodology transfers to any multi-GPU setup; these are the exact specs tested.
- GPUs: 2x RTX 3090 (24 GB each, 48.2 GB usable, PCIe 4.0 x16, 115% power limit)
- CPU: Ryzen 9 9950X3D (16C/32T, Zen 5). CCD 0: 96 MB L3 (64 MB V-Cache + 32 MB on-die). CCD 1: 32 MB L3 only.
- Engine: llama.cpp custom build (
llama-ultimate:v4.5-stable-pgo-zen5), GCC 14 with PGO+LTO targeting Zen 5, approx b9119-b9272 (mid-May 2026). - Model: Qwen3.6-27B-UD-Q6_K_XL GGUF (Unsloth), ~20.5 GB on disk
- Draft model: Qwen3-0.6B (shared tokeniser family for max acceptance rates)
Phase 1: Bayesian Optimisation
Default llama.cpp settings (8 threads, 1024 batch, 512 ubatch, F16 KV, no flash-attn) delivered 911 tok/s prefill and 19.24 tok/s generation — leaving ~20% on the table.
Using llama-optimus (25 Bayesian trials), the optimised configuration converged on 14 threads with 4,944 batch size and 1,024 ubatch. The 14-thread count leaves two of the V-Cache CCD’s 16 logical threads free for OS scheduling without oversubscription.
--threads 14 --threads-batch 14 --batch-size 4096 --ubatch-size 1024
Batch size standardised to 4,096 (power-of-two maps cleaner to CUDA kernels; microbenchmarks showed no meaningful difference from 4,944). Results: 1,042 tok/s prefill (+14.4%) and 22.81 tok/s generation (+18.6%).
Phase 2: 163K Context Bottleneck
At F16, the KV cache for 163K tokens in a 27B model requires ~40 GB. Add 20.5 GB weights + 1.5 GB activation buffers = 62 GB total, exceeding 48.2 GB by ~15 GB.
Fix: KV cache quantisation. F16 → Q8_0 drops KV cache from 40 GB to 20.2 GB. Total: ~42.2 GB, leaving ~6 GB headroom.
--ctx-size 163840 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on
Flash attention is mandatory: it reduces KV cache overhead from O(n) to O(1) and enables Q8_0 cache with layer split. Without it, larger batches produce silent attention corruption.
VRAM breakdown:
- Model weights (Q6_K_XL): ~20.5 GB
- KV cache (163K @ Q8_0): ~20.2 GB
- Activation buffers (4K batch): ~1.5 GB
- Total: ~42.2 GB / 48.2 GB available = ~6 GB headroom
That buffer is thin — activation buffers spike during large-batch prefill. If you hit OOM, reduce --batch-size to 2,048 or --ctx-size to 131,072.
Phase 3: Speculative Decoding Without Silent OOM
The first attempt failed with failed to create draft context + CUDA OOM. llama.cpp allocates the full 163K context for both main and draft models by default — 20.5 GB main + 1.2 GB draft + 40 GB dual KV caches exceeds memory before inference starts.
Fix: Explicitly cap draft context. A draft model only needs local coherence over the draft window, not 163K tokens.
--spec-type draft-model --model /path/to/Qwen3.6-27B-UD-Q6_K_XL.gguf \
--spec-draft-model /path/to/Qwen3-0.6B.gguf --ctx-size 163840 \
--ctx-size-draft 4096 --spec-draft-n-max 32 --spec-draft-p-min 0.4
The 32-token window with --spec-draft-p-min 0.4 is aggressive. On coding tasks, acceptance hits 100% on boilerplate and averages 35-99% depending on complexity.
v7.1 Performance
| Metric | Without Spec Decode | With Spec Decode (v7.1) | Change |
|---|---|---|---|
| Prefill (35K context) | ~372 tok/s | ~428 tok/s | +15.1% |
| Generation (35K context) | ~9.8 tok/s | ~12.5 tok/s | +27.5% |
Acceptance rate is the real variable. At 100%, generation approaches draft-model speed. At 35%, verification overhead barely beats the main model alone. The 22.8 tok/s from Phase 1 drops to 12.5 tok/s at 35K context because acceptance degrades as context grows — the draft model loses coherence over long ranges.
Phase 4: V-Cache CCD Alignment
The 9950X3D has asymmetric CCDs: CCD 0 has 96 MB L3 (64 MB V-Cache + 32 MB on-die), CCD 1 has 32 MB. The earlier 14/6 thread split (20 total) oversubscribed the 8-core, 16-thread V-Cache CCD. Every context switch evicts the V-Cache lines, negating the advantage.
Fix: Hard-pin all inference threads to CCD 0 (cores 0-7, SMT siblings 16-23) using exactly 16 threads. 12 for main model, 4 for draft. Zero migration, no oversubscription.
--threads 12 --threads-batch 12 --draft-threads 4 --spec-draft-p-min 0.7
--spec-draft-p-min raised to 0.7 reduces expensive rollbacks — the draft model only attempts 32-token windows at high confidence. Thread affinity via taskset or container cpuset:
docker run --cpuset-cpus="0-7,16-23" ...
Peak throughput barely changed. The win was consistency: v7.1 occasionally dropped to single-digit tok/s from cache thrashing; v7.2 holds steady throughout a session.
Phase 5: 8K Batching
Dual 3090s have parallel compute that a 4K batch does not fully saturate during prefill. Moving to 8K exploits this directly.
--batch-size 8192 --ubatch-size 4096 --flash-attn on --no-kv-offload
The 4K→8K batch shift delivers +77% average prefill throughput at 95K context. Burst rate jumps to 1,638 tok/s (+309%). GPUs spend more time computing, less time waiting for CPU feeding. Generation improves +44% (4.03 tok/s) — a smaller gain because autoregressive decoding is latency-bound, not throughput-bound.
| Metric | v7.2 (CCD Aligned) | v8.4 | Change |
|---|---|---|---|
| Prefill (95K context) | ~215 tok/s | ~381 tok/s | +77% |
| Prefill burst | ~400 tok/s | ~1,638 tok/s | +309% |
| Generation (95K context) | ~2.8 tok/s | 4.03 tok/s | +44% |
| Draft yield | ~11% | ~22.7% | +106% |
Power Limits
8K batches cause high-frequency power transients. Both GPUs draw max current during prefill. Increase power limit to 115% (242 W per GPU) for thermal headroom. Without it, the GPU throttles clock speeds and negates the benefit.
MTP vs Draft-Model Speculative Decoding
Qwen 3.6 includes native Multi-Token Prediction heads. llama.cpp added support via PR #22673 (mid-May 2026). MTP predicts multiple future tokens from the main model’s hidden states — no separate draft model needed.
--spec-type draft-mtp --spec-draft-n-max 3
Recommend 2-3 draft tokens. Above 5 produces diminishing returns. Official benchmarks show ~75% acceptance at 3 tokens, delivering 2x+ speedup.
MTP supports Tensor and Pipeline Parallelism but not yet -np > 1 (parallel decoding) or --mmproj (vision encoder). On dual GPU setups, draft-model speculation is more reliable today: the MTP draft head loads unevenly across GPUs in split mode, and prefill throughput takes a hit from host-to-device embedding transfers. Single GPU users should prefer MTP (simpler, no second model VRAM). Dual GPU users should use the Qwen3-0.6B draft model approach from Phase 3.
MTP-native models — architectures trained from scratch with multi-token prediction heads rather than retrofitting them onto existing models — are the next frontier in inference efficiency. I am benchmarking several MTP-first models (including Qwen 3 dense MTP models) on this dual 3090 rig and will publish a dedicated MTP tuning guide covering head-aware batch sizing, optimal draft depths, vLLM speculative decoding configuration, and GPU split strategies specific to native MTP architectures.
Complete Configuration
The exact config that delivered v8.4 results. Requires llama-server from a build after PR #22673.
llama-server \
--model /models/Qwen3.6-27B-UD-Q6_K_XL.gguf \
--spec-draft-model /models/Qwen3-0.6B.gguf \
--ctx-size 163840 \
--ctx-size-draft 4096 \
--batch-size 8192 \
--ubatch-size 4096 \
--threads 12 \
--threads-batch 12 \
--draft-threads 4 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--split-mode layer \
--spec-type draft-model \
--spec-draft-n-max 32 \
--spec-draft-p-min 0.7 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--host 0.0.0.0 \
--port 8080
With --cpuset-cpus="0-7,16-23" and 115% GPU power limits:
- 22.8 tok/s generation (low context) · 1,042 tok/s prefill (low context)
- 4.03 tok/s generation (95K context) · 381 tok/s prefill (95K context)
- ~22.7% draft token yield
Limitations
Not everything responds to tuning.
Generation degrades linearly with context length. At 163K context, 2.8-4.0 tok/s. Every token requires attention over 163K previous tokens — no thread or batch tuning bypasses that.
Draft yield degrades with context length. Qwen3-0.6B accepts well on short contexts but drops significantly as context grows. A larger draft model would consume VRAM that is already scarce.
Layer split leaves one GPU idle during parts of computation. True tensor parallelism (--split-mode tensor) would be more efficient but requires F16/BF16 KV cache and NCCL — neither fits the 48 GB VRAM budget at 163K context.
Power draw is substantial. Two 3090s at 115% pull ~700 W under load (~5.6 kWh per 8-hour session). At UK rates, ~£4-5/day plus the 9950X3D at 170 W adding ~£1.
Related Guides
The general Qwen 3.6-27B deployment guide covers running the model at different hardware tiers. This article is the opposite: a hardware-locked tuning sequence for one engine and one GPU config. Single RTX 4090 or Apple Silicon? Use the deployment guide. Dual 3090s and want every tok/s? Start here.
For broader context: best local LLM models 2026, self-hosted LLM guide, and the $1,000 local LLM rig guide explaining why dual 3090s are the value sweet spot.
Reproducing These Results
Apply the full v8.4 configuration — do not change flags one at a time. Benchmark with llama.cpp’s --benchmark mode or the built-in perplexity benchmark.
Verify on your hardware:
- Low-context prefill > 1,000 tok/s
- Low-context generation > 20 tok/s
- Non-zero draft token acceptance (check server logs for
draft tokens accepted) - VRAM < 46 GB during prefill (
nvidia-smi)
If you hit OOM: reduce --batch-size to 4,096 first, then --ctx-size to 131,072. If speculative decoding fails to initialise, confirm your build includes PR #22673 or that the draft model path is correct.
Full benchmark data, container images, and config files are in the article repository. The tuning methodology transfers to any hardware — the raw numbers will differ.