Open-Source LLMs in 2026: Llama 4, Qwen 3.5, DeepSeek R1 — The Self-Hosted Scorecard
In This Article
01 The 2026 Open-Source Landscape at a Glance
02 Model Architecture Shifts: MoE vs Dense
03 Benchmarks: Where Each Model Wins
04 Hardware Reality: VRAM, Quantization, Cost-per-Token
MMLU-Pro / GPQA / HumanEval
VRAM & Cost Analysis
Apache 2.0 vs MIT vs Custom
The open-weight landscape has fundamentally shifted in 2026. Six months ago, Llama 3.1 70B was the default answer for self-hosting. Today, the tier list has inverted: Mixture-of-Experts (MoE) models like Llama 4 Scout (17B active) and Qwen 3.5 (22B active) deliver GPT-4-class reasoning on a single H100, while dense models like Qwen3-32B punch above their weight on coding benchmarks.
But benchmarks don’t deploy models — budgets and VRAM do. I’ve analyzed the 2026 flagship open models across quality, hardware requirements, licensing, and real-world cost-per-million-tokens. Here’s the scorecard for teams deciding what to run on their own iron.
The 2026 Open-Source Landscape at a Glance
Five models dominate the conversation. Three are MoE, one is dense, one is a reasoning specialist.

Architecture comparison: MoE models activate few parameters per token but need all weights in VRAM
MoE models like Llama 4 only activate ~17B params per token, but all 109-400B weights must fit in VRAM. The “active params = low VRAM” intuition is wrong. INT4 quantization gets Scout onto 1× H100 (65 GB), but Maverick needs 4× H100 minimum.
Model Architecture Shifts: MoE vs Dense
The 2026 story is MoE dominance for flagship models, with dense holdouts for single-GPU deployments.
Why MoE Won the Flagship Tier
Llama 4, Qwen 3.5, and DeepSeek V3.2 all use Mixture-of-Experts. The math: with 17B active params out of 109B (Scout) or 400B (Maverick), you get 6-24x parameter efficiency at inference. But the VRAM cost is all weights — expert routing happens at runtime but experts stay loaded.
The Dense Holdout: Qwen3-32B
Qwen3-32B proves dense isn’t dead. At 33 GB FP8 (or ~18 GB INT4), it fits on a single H100 80GB with room for KV cache. Its HumanEval score of 88-90% rivals models 10x its size. For coding-focused teams without multi-GPU clusters, it’s the pragmatic choice.
Benchmarks: Where Each Model Wins
Aggregate benchmarks hide specialization. Here’s the breakdown by task.

H100 cluster for DeepSeek R1/V3.2 — 8 GPUs minimum
Bottom line: No single model wins everything. Qwen 3.5 dominates reasoning and instruction following. DeepSeek R1/V3.2 owns math and SWE-bench. GPT-oss 120B leads MMLU-Pro. For coding, Qwen3-32B is the price/performance king.
Hardware Reality: VRAM, Quantization, Cost-per-Token
This is where procurement meets reality.
Qwen3-32B and Llama 4 Scout are essentially tied at ~$0.85/M tokens on a single H100. Scout wins on context (10M tokens!) and reasoning; Qwen3-32B wins on coding and licensing. DeepSeek models are 12-18x more expensive per token — only justified for math/reasoning workloads where they lead.
H100 spot pricing (~$1.03/hr) vs on-demand (~$2.50/hr) cuts batch cost-per-token by 60%. For async workloads (batch inference, eval runs), spot is a no-brainer. For latency-sensitive serving, reserve capacity.
Licensing: What “Open” Actually Means
Not all open weights are created equal.
For enterprise deployments: Apache 2.0 (Qwen) and MIT (DeepSeek) are the safest — no usage restrictions, no attribution burdens, no competing-service clauses. Llama 4’s Community License adds legal review overhead for any customer-facing product.
The Enterprise Decision Matrix
Match your constraints to the right model.
FAQ
Can I really run Llama 4 Scout on one H100?
Yes, at INT4 quantization (~55-65 GB). You need ~65-70 GB total with KV cache overhead for 8K context. For 10M context, you’ll hit VRAM limits — the 10M claim assumes expert offloading or multi-node. For practical single-GPU: 8K-32K context is realistic.
Is Qwen3-32B actually better at coding than Llama 4 Maverick?
On HumanEval, Qwen3-32B reports 88-90% vs Maverick’s ~60-61% (Scout) / 82-86% (Maverick). But benchmark methodology varies. If coding is your primary workload, verify on your eval set. The dense architecture gives Qwen3-32B more consistent token-by-token reasoning for code.
What about GPT-oss 20B/120B?
OpenAI’s open releases are strong — GPT-oss 120B hits ~90% MMLU-Pro and 85% GPQA, rivaling Qwen 3.5. But the custom license adds compliance overhead. For teams already on OpenAI APIs, it’s a natural bridge. For pure self-hosting, Qwen/DeepSeek have cleaner licenses.
How much does quantization actually hurt quality?
INT4 typically costs 0.5-1.5% on MMLU/GPQA vs FP16. INT8 is nearly lossless. The bigger hit is on long-context tasks where lower precision accumulates error. For 10M context on Scout, some teams stay at FP8/BF16 on multi-GPU. For single-GPU 8K context, INT4 is fine.
Building Your Own LLM Stack?
From hardware sizing to quantization pipelines — we’re mapping the self-hosted landscape.