What is the best batch size for dual RTX 3090s with Qwen 3.6-27B in llama.cpp?

8,192 tokens for batch-size with 4,096 for ubatch-size delivers the best throughput. This keeps both GPUs saturated during prefill without excessive VRAM overhead. The Bayesian optimisation peak was near 4,944 but standardising to 8,192 gave better real-world performance with speculative decoding active.

How much VRAM does Qwen 3.6-27B need on dual 3090s at 163K context?

With Q6_K_XL quantisation (~20.5 GB for weights) and Q8_0 KV cache (~20.2 GB for 163K context), total usage is approximately 42.2 GB out of 48.2 GB available. This leaves about 6 GB headroom. F16 KV cache is impossible at this context length, requiring ~40 GB just for the cache.

Does speculative decoding work with dual GPUs in llama.cpp?

Yes, but you must explicitly cap the draft context size (--ctx-size-draft 4096) to avoid CUDA OOM. Without this cap, llama.cpp attempts to allocate 163K context for both main and draft models, exceeding 48 GB VRAM. Using a Qwen 0.6B draft model with 32-token draft windows works well.

What thread count is optimal for a Ryzen 9 9950X3D with llama.cpp?

12 main threads plus 4 draft threads (16 total) aligned to the V-Cache CCD (cores 0-7 and 16-23) gives the best results. Going above 16 threads causes oversubscription of the 8-core V-Cache CCD, leading to context switching overhead and cache thrashing.

What's the real-world generation speed for Qwen 3.6-27B on dual 3090s?

With the fully optimised configuration (v8.4), expect about 4 tok/s at 95K context and 22.8 tok/s at low context. Prompt prefill reaches 381 tok/s average with bursts up to 1,638 tok/s. Speculative decoding acceptance rates vary from 22.7% average to 100% on predictable code.

Can I use MTP (Multi-Token Prediction) instead of a draft model?

Yes, Qwen 3.6 has native MTP support and llama.cpp PR #22673 added support in mid-May 2026. MTP achieves 1.4x-2.2x speedup without needing a separate draft model. MTP works with Tensor and Pipeline Parallelism for multi-GPU setups, but does not yet support vision encoders (--mmproj) or parallel decoding (-np > 1).

DUAL RTX 3090 QWEN 3.6-27B TUNING: 22.8 TOK/S WITH LLAMA.CPP

21/5/2026
Updated 6/6/2026
8-minute read
1649 words

Default llama.cpp settings on a dual RTX 3090 rig with Ryzen 9 9950X3D deliver 19 tok/s — and fail to load a 163K context window at all. After 25 Bayesian optimisation trials across five tuning phases, the final configuration reaches 22.8 tok/s generation with 1,638 tok/s prefill bursts and enables 163,840-token context without OOM.

Who This Is For: Multi-GPU llama.cpp users who want every last tok/s, anyone fighting 24 GB VRAM limits at large context, Ryzen 9950X3D owners wondering if V-Cache matters for inference, or anyone who hit silent CUDA OOMs with speculative decoding. Not for Ollama one-command setups.

What You Will Learn: Optimal thread count for dual-CCD Zen 5, why Q8_0 KV cache is mandatory at 163K context, reliable speculative decoding on dual GPUs, batch-size scaling impact on prefill, and the exact v8.4 configuration that hits 22.8 tok/s.

Hardware Baseline

The tuning methodology transfers to any multi-GPU setup; these are the exact specs tested.

GPUs: 2x RTX 3090 (24 GB each, 48.2 GB usable, PCIe 4.0 x16, 115% power limit)
CPU: Ryzen 9 9950X3D (16C/32T, Zen 5). CCD 0: 96 MB L3 (64 MB V-Cache + 32 MB on-die). CCD 1: 32 MB L3 only.
Engine: llama.cpp custom build (llama-ultimate:v4.5-stable-pgo-zen5), GCC 14 with PGO+LTO targeting Zen 5, approx b9119-b9272 (mid-May 2026).
Model: Qwen3.6-27B-UD-Q6_K_XL GGUF (Unsloth), ~20.5 GB on disk
Draft model: Qwen3-0.6B (shared tokeniser family for max acceptance rates)

Phase 1: Bayesian Optimisation

Default llama.cpp settings (8 threads, 1024 batch, 512 ubatch, F16 KV, no flash-attn) delivered 911 tok/s prefill and 19.24 tok/s generation — leaving ~20% on the table.

Using llama-optimus (25 Bayesian trials), the optimised configuration converged on 14 threads with 4,944 batch size and 1,024 ubatch. The 14-thread count leaves two of the V-Cache CCD’s 16 logical threads free for OS scheduling without oversubscription.

--threads 14 --threads-batch 14 --batch-size 4096 --ubatch-size 1024

Batch size standardised to 4,096 (power-of-two maps cleaner to CUDA kernels; microbenchmarks showed no meaningful difference from 4,944). Results: 1,042 tok/s prefill (+14.4%) and 22.81 tok/s generation (+18.6%).

Phase 2: 163K Context Bottleneck

At F16, the KV cache for 163K tokens in a 27B model requires ~40 GB. Add 20.5 GB weights + 1.5 GB activation buffers = 62 GB total, exceeding 48.2 GB by ~15 GB.

Fix: KV cache quantisation. F16 → Q8_0 drops KV cache from 40 GB to 20.2 GB. Total: ~42.2 GB, leaving ~6 GB headroom.

--ctx-size 163840 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on

Flash attention is mandatory: it reduces KV cache overhead from O(n) to O(1) and enables Q8_0 cache with layer split. Without it, larger batches produce silent attention corruption.

VRAM breakdown:

Model weights (Q6_K_XL): ~20.5 GB
KV cache (163K @ Q8_0): ~20.2 GB
Activation buffers (4K batch): ~1.5 GB
Total: ~42.2 GB / 48.2 GB available = ~6 GB headroom

That buffer is thin — activation buffers spike during large-batch prefill. If you hit OOM, reduce --batch-size to 2,048 or --ctx-size to 131,072.

Phase 3: Speculative Decoding Without Silent OOM

The first attempt failed with failed to create draft context + CUDA OOM. llama.cpp allocates the full 163K context for both main and draft models by default — 20.5 GB main + 1.2 GB draft + 40 GB dual KV caches exceeds memory before inference starts.

Fix: Explicitly cap draft context. A draft model only needs local coherence over the draft window, not 163K tokens.

--spec-type draft-model --model /path/to/Qwen3.6-27B-UD-Q6_K_XL.gguf \
--spec-draft-model /path/to/Qwen3-0.6B.gguf --ctx-size 163840 \
--ctx-size-draft 4096 --spec-draft-n-max 32 --spec-draft-p-min 0.4

The 32-token window with --spec-draft-p-min 0.4 is aggressive. On coding tasks, acceptance hits 100% on boilerplate and averages 35-99% depending on complexity.

v7.1 Performance

Metric	Without Spec Decode	With Spec Decode (v7.1)	Change
Prefill (35K context)	~372 tok/s	~428 tok/s	+15.1%
Generation (35K context)	~9.8 tok/s	~12.5 tok/s	+27.5%

Acceptance rate is the real variable. At 100%, generation approaches draft-model speed. At 35%, verification overhead barely beats the main model alone. The 22.8 tok/s from Phase 1 drops to 12.5 tok/s at 35K context because acceptance degrades as context grows — the draft model loses coherence over long ranges.

Phase 4: V-Cache CCD Alignment

The 9950X3D has asymmetric CCDs: CCD 0 has 96 MB L3 (64 MB V-Cache + 32 MB on-die), CCD 1 has 32 MB. The earlier 14/6 thread split (20 total) oversubscribed the 8-core, 16-thread V-Cache CCD. Every context switch evicts the V-Cache lines, negating the advantage.

Fix: Hard-pin all inference threads to CCD 0 (cores 0-7, SMT siblings 16-23) using exactly 16 threads. 12 for main model, 4 for draft. Zero migration, no oversubscription.

--threads 12 --threads-batch 12 --draft-threads 4 --spec-draft-p-min 0.7

--spec-draft-p-min raised to 0.7 reduces expensive rollbacks — the draft model only attempts 32-token windows at high confidence. Thread affinity via taskset or container cpuset:

docker run --cpuset-cpus="0-7,16-23" ...

Peak throughput barely changed. The win was consistency: v7.1 occasionally dropped to single-digit tok/s from cache thrashing; v7.2 holds steady throughout a session.

Phase 5: 8K Batching

Dual 3090s have parallel compute that a 4K batch does not fully saturate during prefill. Moving to 8K exploits this directly.

--batch-size 8192 --ubatch-size 4096 --flash-attn on --no-kv-offload

The 4K→8K batch shift delivers +77% average prefill throughput at 95K context. Burst rate jumps to 1,638 tok/s (+309%). GPUs spend more time computing, less time waiting for CPU feeding. Generation improves +44% (4.03 tok/s) — a smaller gain because autoregressive decoding is latency-bound, not throughput-bound.

Metric	v7.2 (CCD Aligned)	v8.4	Change
Prefill (95K context)	~215 tok/s	~381 tok/s	+77%
Prefill burst	~400 tok/s	~1,638 tok/s	+309%
Generation (95K context)	~2.8 tok/s	4.03 tok/s	+44%
Draft yield	~11%	~22.7%	+106%

Power Limits

8K batches cause high-frequency power transients. Both GPUs draw max current during prefill. Increase power limit to 115% (242 W per GPU) for thermal headroom. Without it, the GPU throttles clock speeds and negates the benefit.

MTP vs Draft-Model Speculative Decoding

Qwen 3.6 includes native Multi-Token Prediction heads. llama.cpp added support via PR #22673 (mid-May 2026). MTP predicts multiple future tokens from the main model’s hidden states — no separate draft model needed.

--spec-type draft-mtp --spec-draft-n-max 3

Recommend 2-3 draft tokens. Above 5 produces diminishing returns. Official benchmarks show ~75% acceptance at 3 tokens, delivering 2x+ speedup.

MTP supports Tensor and Pipeline Parallelism but not yet -np > 1 (parallel decoding) or --mmproj (vision encoder). On dual GPU setups, draft-model speculation is more reliable today: the MTP draft head loads unevenly across GPUs in split mode, and prefill throughput takes a hit from host-to-device embedding transfers. Single GPU users should prefer MTP (simpler, no second model VRAM). Dual GPU users should use the Qwen3-0.6B draft model approach from Phase 3.

MTP-native models — architectures trained from scratch with multi-token prediction heads rather than retrofitting them onto existing models — are the next frontier in inference efficiency. I am benchmarking several MTP-first models (including Qwen 3 dense MTP models) on this dual 3090 rig and will publish a dedicated MTP tuning guide covering head-aware batch sizing, optimal draft depths, vLLM speculative decoding configuration, and GPU split strategies specific to native MTP architectures.

Complete Configuration

The exact config that delivered v8.4 results. Requires llama-server from a build after PR #22673.

llama-server \
  --model /models/Qwen3.6-27B-UD-Q6_K_XL.gguf \
  --spec-draft-model /models/Qwen3-0.6B.gguf \
  --ctx-size 163840 \
  --ctx-size-draft 4096 \
  --batch-size 8192 \
  --ubatch-size 4096 \
  --threads 12 \
  --threads-batch 12 \
  --draft-threads 4 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --split-mode layer \
  --spec-type draft-model \
  --spec-draft-n-max 32 \
  --spec-draft-p-min 0.7 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --host 0.0.0.0 \
  --port 8080

With --cpuset-cpus="0-7,16-23" and 115% GPU power limits:

22.8 tok/s generation (low context) · 1,042 tok/s prefill (low context)
4.03 tok/s generation (95K context) · 381 tok/s prefill (95K context)
~22.7% draft token yield

Limitations

Not everything responds to tuning.

Generation degrades linearly with context length. At 163K context, 2.8-4.0 tok/s. Every token requires attention over 163K previous tokens — no thread or batch tuning bypasses that.

Draft yield degrades with context length. Qwen3-0.6B accepts well on short contexts but drops significantly as context grows. A larger draft model would consume VRAM that is already scarce.

Layer split leaves one GPU idle during parts of computation. True tensor parallelism (--split-mode tensor) would be more efficient but requires F16/BF16 KV cache and NCCL — neither fits the 48 GB VRAM budget at 163K context.

Power draw is substantial. Two 3090s at 115% pull ~700 W under load (~5.6 kWh per 8-hour session). At UK rates, ~£4-5/day plus the 9950X3D at 170 W adding ~£1.

The general Qwen 3.6-27B deployment guide covers running the model at different hardware tiers. This article is the opposite: a hardware-locked tuning sequence for one engine and one GPU config. Single RTX 4090 or Apple Silicon? Use the deployment guide. Dual 3090s and want every tok/s? Start here.

For broader context: best local LLM models 2026, self-hosted LLM guide, and the $1,000 local LLM rig guide explaining why dual 3090s are the value sweet spot.

Reproducing These Results

Apply the full v8.4 configuration — do not change flags one at a time. Benchmark with llama.cpp’s --benchmark mode or the built-in perplexity benchmark.

Verify on your hardware:

Low-context prefill > 1,000 tok/s
Low-context generation > 20 tok/s
Non-zero draft token acceptance (check server logs for draft tokens accepted)
VRAM < 46 GB during prefill (nvidia-smi)

If you hit OOM: reduce --batch-size to 4,096 first, then --ctx-size to 131,072. If speculative decoding fails to initialise, confirm your build includes PR #22673 or that the draft model path is correct.

Full benchmark data, container images, and config files are in the article repository. The tuning methodology transfers to any hardware — the raw numbers will differ.

Need help tuning your local LLM rig?

I help teams and individuals optimise their local AI infrastructure — from GPU selection and cooling to framework configuration and benchmarking. Tell me about your hardware and I will tell you what to change.

Book a consultation

ai local-llm optimization performance hardware