AI & THE FUTURE

Maya Chen

AI & The Future · June 20, 2026

Open-Source LLMs in 2026: Llama 4, Qwen 3.5, DeepSeek R1 — The Self-Hosted Scorecard

In This Article

01 The 2026 Open-Source Landscape at a Glance

02 Model Architecture Shifts: MoE vs Dense

03 Benchmarks: Where Each Model Wins

04 Hardware Reality: VRAM, Quantization, Cost-per-Token

05 Licensing: What “Open” Actually Means

06 The Enterprise Decision Matrix

5 Flagship Models Compared
MMLU-Pro / GPQA / HumanEval
VRAM & Cost Analysis
Apache 2.0 vs MIT vs Custom

The open-weight landscape has fundamentally shifted in 2026. Six months ago, Llama 3.1 70B was the default answer for self-hosting. Today, the tier list has inverted: Mixture-of-Experts (MoE) models like Llama 4 Scout (17B active) and Qwen 3.5 (22B active) deliver GPT-4-class reasoning on a single H100, while dense models like Qwen3-32B punch above their weight on coding benchmarks.

But benchmarks don’t deploy models — budgets and VRAM do. I’ve analyzed the 2026 flagship open models across quality, hardware requirements, licensing, and real-world cost-per-million-tokens. Here’s the scorecard for teams deciding what to run on their own iron.

The 2026 Open-Source Landscape at a Glance

Five models dominate the conversation. Three are MoE, one is dense, one is a reasoning specialist.

Model	Total / Active Params	Architecture	Context Window	License	Min VRAM (INT4)
Llama 4 Scout	109B / 17B	MoE	10M	Llama 4 Community	~55-65 GB
Llama 4 Maverick	400B / 17B	MoE	1M	Llama 4 Community	~200-243 GB
Qwen 3.5	397B / varies	MoE	131K+	Apache 2.0	~200 GB
Qwen3-32B	32.8B / 32.8B	Dense	131K (YaRN)	Apache 2.0	~33 GB (single H100!)
DeepSeek R1 / V3.2	671B / 37B (R1) / 685B (V3.2)	MoE	128K / 130K	MIT	~351 GB (8× H100)

Architecture comparison: MoE models activate few parameters per token but need all weights in VRAM

Key Insight: Active vs Total Parameters

MoE models like Llama 4 only activate ~17B params per token, but all 109-400B weights must fit in VRAM. The “active params = low VRAM” intuition is wrong. INT4 quantization gets Scout onto 1× H100 (65 GB), but Maverick needs 4× H100 minimum.

Model Architecture Shifts: MoE vs Dense

The 2026 story is MoE dominance for flagship models, with dense holdouts for single-GPU deployments.

Why MoE Won the Flagship Tier

Llama 4, Qwen 3.5, and DeepSeek V3.2 all use Mixture-of-Experts. The math: with 17B active params out of 109B (Scout) or 400B (Maverick), you get 6-24x parameter efficiency at inference. But the VRAM cost is all weights — expert routing happens at runtime but experts stay loaded.

The Dense Holdout: Qwen3-32B

Qwen3-32B proves dense isn’t dead. At 33 GB FP8 (or ~18 GB INT4), it fits on a single H100 80GB with room for KV cache. Its HumanEval score of 88-90% rivals models 10x its size. For coding-focused teams without multi-GPU clusters, it’s the pragmatic choice.

Benchmarks: Where Each Model Wins

Aggregate benchmarks hide specialization. Here’s the breakdown by task.

Benchmark	Best Model	Score	Runner-Up	Score
MMLU-Pro (Advanced Knowledge)	GPT-oss 120B	~90%	Qwen 3.5	~88%
GPQA Diamond (PhD Science)	Qwen 3.5	88.4%	Kimi K2.5	87.6%
HumanEval (Coding)	Qwen3-235B-A22B	90.3% (unverified)	Qwen3-32B	88.0%
IFEval (Instruction Following)	Kimi K2.5	94.0%	Qwen 3.5	92.6%
Chatbot Arena Elo (Human Pref)	GLM-5	1454	Qwen 3.5	1450
SWE-bench Verified (Software Eng)	DeepSeek V3.2	67.8%	Devstral-2-123B	72.2%
AIME 2025 (Math Olympiad)	DeepSeek R1	87.5%	DeepSeek V3.2	89.3%

H100 cluster for DeepSeek R1/V3.2 — 8 GPUs minimum

Bottom line: No single model wins everything. Qwen 3.5 dominates reasoning and instruction following. DeepSeek R1/V3.2 owns math and SWE-bench. GPT-oss 120B leads MMLU-Pro. For coding, Qwen3-32B is the price/performance king.

Hardware Reality: VRAM, Quantization, Cost-per-Token

This is where procurement meets reality.

Model	FP16 VRAM	INT4 VRAM	Min GPUs (H100 80GB)	Cost/1M Tokens (H100 Spot)	Throughput (tok/s)
Qwen3-32B	~66 GB	~18 GB	1× H100	~$0.82	~850
Llama 4 Scout	~218 GB	~55-65 GB	1× H100	~$0.87	~800
Llama 4 Maverick	~800 GB	~200-243 GB	4× H100	~$2.31	~1,200
DeepSeek V3.2 / R1	~1.37 TB	~351 GB	8× H100	~$9-14	~400-600

Qwen3-32B and Llama 4 Scout are essentially tied at ~$0.85/M tokens on a single H100. Scout wins on context (10M tokens!) and reasoning; Qwen3-32B wins on coding and licensing. DeepSeek models are 12-18x more expensive per token — only justified for math/reasoning workloads where they lead.

Spot vs On-Demand Changes Everything

H100 spot pricing (~$1.03/hr) vs on-demand (~$2.50/hr) cuts batch cost-per-token by 60%. For async workloads (batch inference, eval runs), spot is a no-brainer. For latency-sensitive serving, reserve capacity.

Licensing: What “Open” Actually Means

Not all open weights are created equal.

Model	License	Commercial Use	Modification	Redistribution	Caveats
Qwen 3.5 / Qwen3	Apache 2.0	✅	✅	✅	Most permissive
DeepSeek R1 / V3.2	MIT	✅	✅	✅	Very permissive
Llama 4 Scout / Maverick	Llama 4 Community	✅ (with limits)	✅	⚠️ Restricted	No competing services; 700M+ MAU requires license
GPT-oss 20B/120B	OpenAI Custom	✅	✅	⚠️	No military; attribution required

For enterprise deployments: Apache 2.0 (Qwen) and MIT (DeepSeek) are the safest — no usage restrictions, no attribution burdens, no competing-service clauses. Llama 4’s Community License adds legal review overhead for any customer-facing product.

The Enterprise Decision Matrix

Match your constraints to the right model.

If Your Constraint Is…	Deploy This Model	Why
Single H100 budget, need coding	Qwen3-32B	Best HumanEval per dollar, Apache 2.0, fits 80GB with KV headroom
Single H100, need long context	Llama 4 Scout	10M context window, fits INT4 on 80GB
4× H100 cluster, balanced quality	Llama 4 Maverick	Near-Scout quality with 1M context, better throughput
8× H100, math/reasoning priority	DeepSeek R1 / V3.2	Unmatched GPQA, AIME, SWE-bench; MIT license
Legal team says “no custom licenses”	Qwen 3.5 / DeepSeek	Apache 2.0 / MIT — zero restrictions
Need instruction-following agents	Qwen 3.5 / GLM-5	Top IFEval scores (92-94%)

FAQ

Can I really run Llama 4 Scout on one H100?

Yes, at INT4 quantization (~55-65 GB). You need ~65-70 GB total with KV cache overhead for 8K context. For 10M context, you’ll hit VRAM limits — the 10M claim assumes expert offloading or multi-node. For practical single-GPU: 8K-32K context is realistic.

Is Qwen3-32B actually better at coding than Llama 4 Maverick?

On HumanEval, Qwen3-32B reports 88-90% vs Maverick’s ~60-61% (Scout) / 82-86% (Maverick). But benchmark methodology varies. If coding is your primary workload, verify on your eval set. The dense architecture gives Qwen3-32B more consistent token-by-token reasoning for code.

What about GPT-oss 20B/120B?

OpenAI’s open releases are strong — GPT-oss 120B hits ~90% MMLU-Pro and 85% GPQA, rivaling Qwen 3.5. But the custom license adds compliance overhead. For teams already on OpenAI APIs, it’s a natural bridge. For pure self-hosting, Qwen/DeepSeek have cleaner licenses.

How much does quantization actually hurt quality?

INT4 typically costs 0.5-1.5% on MMLU/GPQA vs FP16. INT8 is nearly lossless. The bigger hit is on long-context tasks where lower precision accumulates error. For 10M context on Scout, some teams stay at FP8/BF16 on multi-GPU. For single-GPU 8K context, INT4 is fine.

Building Your Own LLM Stack?

From hardware sizing to quantization pipelines — we’re mapping the self-hosted landscape.

More AI Analysis →

Open-Source LLMs in 2026: Llama 4, Qwen 3.5, DeepSeek R1 — The Self-Hosted Scorecard

Open-Source LLMs in 2026: Llama 4, Qwen 3.5, DeepSeek R1 — The Self-Hosted Scorecard

The 2026 Open-Source Landscape at a Glance

Model Architecture Shifts: MoE vs Dense

Why MoE Won the Flagship Tier

The Dense Holdout: Qwen3-32B

Benchmarks: Where Each Model Wins

Hardware Reality: VRAM, Quantization, Cost-per-Token

Licensing: What “Open” Actually Means

The Enterprise Decision Matrix

FAQ

Can I really run Llama 4 Scout on one H100?

Is Qwen3-32B actually better at coding than Llama 4 Maverick?

What about GPT-oss 20B/120B?

How much does quantization actually hurt quality?

Building Your Own LLM Stack?

Maya Chen

Samsung Galaxy Z Fold 7 vs OnePlus Open 2 vs Pixel Fold 2: The 2026 Foldable Verdict

Related Posts