What local LLM do you actually run for coding in 2026?

Qwen 3.6-27B dense on dual RTX 3090s, via llama.cpp for peak single-user performance (22.8 tok/s, 163K context) and vLLM for multi-user serving. Full tuning guide is on this site.

What cloud models do you use?

DeepSeek V4 Flash for daily complex reasoning (90% of Pro quality at lower cost), DeepSeek V4 Pro for peak reasoning, and Gemini 3.5 for heavy multimodal and broad-context work.

Do you use DeepSeek R1, Llama 4, or Gemma 4?

No. DeepSeek V4 displaced R1 for reasoning. Qwen 3.6-27B covers my local coding needs. Gemini 3.5 handles multimodal. I keep it to three models that actually earn their place in my workflow.

BEST LOCAL LLM MODELS 2026: WHICH ONE TO RUN FOR YOUR USE CASE

13/4/2026
3-minute read
428 words

Updated June 2026: Frontier models are cloud, local models handle daily coding and privacy-sensitive work. Pick the right tool for the job.

The Three I Actually Run

I keep it tight. Three models, each with a clear job.

Qwen 3.6-27B Dense — Local Coding

Released 22 April 2026, this is Alibaba’s flagship open-weight dense model. 262K native context (extensible to ~1M via YaRN), 32K output tokens, natively multimodal (text, images, video in one checkpoint). Every parameter is active — no MoE稀疏性, no quality tradeoff for speed.

Key benchmarks:

SWE-bench Verified: 77.2% — real-world coding agent tasks
AIME 2026: 94.1% — frontier reasoning
GPQA Diamond: 87.8% — graduate-level science
MMMU: 82.9% — multimodal reasoning
Terminal-Bench 2.0: 59.3% — agentic CLI tasks

Native MTP heads for speculative decoding (1.4x-2.2x speedup, ~75% acceptance at 3 draft tokens, added in llama.cpp via PR #22673 mid-May). Generates reasoning traces inside <think> blocks by default, toggleable for latency-sensitive work.

My daily coding workhorse on dual RTX 3090s. Runs on llama.cpp (tuned: 22.8 tok/s, 1,638 tok/s prefill bursts, 163K context) and vLLM (>=0.19.0, multi-user serving, OpenAI-compatible API). Full tuning report /.

Handles code generation, review, repo analysis, and anything privacy-sensitive. No rate limits, no data leaving my desk.

DeepSeek V4 Flash (and Pro) — Cloud Reasoning

V4 Flash is my default for complex analysis, architecture decisions, and anything needing strong chain-of-thought reasoning. ~90% of Pro quality at a fraction of the cost. I reach for Pro when the problem genuinely needs frontier capability.

Both deliver visible reasoning traces — useful for auditing and debugging wrong answers.

Gemini 3.5 — Cloud Multimodal

When the problem spans text, images, code, and large documents, this is the pick. Massive context windows, strong multimodal understanding, Google ecosystem integration.

My Daily Stack

Code: Qwen 3.6-27B dense via llama.cpp + vLLM (local, dual 3090)
Reasoning: DeepSeek V4 Flash (cloud API)
Multimodal & research: Gemini 3.5 (cloud API)
Peak reasoning: DeepSeek V4 Pro (cloud API, when needed)

Local for daily coding — privacy, no rate limits, fast iteration. Cloud for heavy reasoning and multimodal when the problem demands frontier capability.

The Bottom Line

Three models. Three jobs. That’s it.

Local coding: Qwen 3.6-27B dense (llama.cpp + vLLM, dual 3090)
Cloud reasoning: DeepSeek V4 Flash daily, V4 Pro for peak
Cloud multimodal: Gemini 3.5

Self-Hosted LLM Guide 2026 / — Complete setup with Ollama, llama.cpp, vLLM
Building Affordable AI Hardware for Local LLMs / — GPU recommendations
Small LLMs Are the Future / — Why smaller models are winning
Claude Outages: Why Local Matters / — The case for local failover

ai local-llm hardware comparison