Get In Touch
541 Melville Ave, Palo Alto, CA 94301,
ask@ohio.clbthemes.com
Ph: +1.831.705.5448
Work Inquiries
work@ohio.clbthemes.com
Ph: +1.831.306.6725
Back

Open-Source LLMs in 2026: Llama 4, Qwen 3.5, DeepSeek R1 — The Self-Hosted Scorecard

AI & THE FUTURE
M
Maya Chen
AI & The Future · June 20, 2026

Open-Source LLMs in 2026: Llama 4, Qwen 3.5, DeepSeek R1 — The Self-Hosted Scorecard

5 Flagship Models Compared
MMLU-Pro / GPQA / HumanEval
VRAM & Cost Analysis
Apache 2.0 vs MIT vs Custom

The open-weight landscape has fundamentally shifted in 2026. Six months ago, Llama 3.1 70B was the default answer for self-hosting. Today, the tier list has inverted: Mixture-of-Experts (MoE) models like Llama 4 Scout (17B active) and Qwen 3.5 (22B active) deliver GPT-4-class reasoning on a single H100, while dense models like Qwen3-32B punch above their weight on coding benchmarks.

But benchmarks don’t deploy models — budgets and VRAM do. I’ve analyzed the 2026 flagship open models across quality, hardware requirements, licensing, and real-world cost-per-million-tokens. Here’s the scorecard for teams deciding what to run on their own iron.

The 2026 Open-Source Landscape at a Glance

Five models dominate the conversation. Three are MoE, one is dense, one is a reasoning specialist.

Model Total / Active Params Architecture Context Window License Min VRAM (INT4)
Llama 4 Scout 109B / 17B MoE 10M Llama 4 Community ~55-65 GB
Llama 4 Maverick 400B / 17B MoE 1M Llama 4 Community ~200-243 GB
Qwen 3.5 397B / varies MoE 131K+ Apache 2.0 ~200 GB
Qwen3-32B 32.8B / 32.8B Dense 131K (YaRN) Apache 2.0 ~33 GB (single H100!)
DeepSeek R1 / V3.2 671B / 37B (R1) / 685B (V3.2) MoE 128K / 130K MIT ~351 GB (8× H100)

Comparison chart of Llama 4 Qwen 3.5 DeepSeek model architectures

Architecture comparison: MoE models activate few parameters per token but need all weights in VRAM

Key Insight: Active vs Total Parameters

MoE models like Llama 4 only activate ~17B params per token, but all 109-400B weights must fit in VRAM. The “active params = low VRAM” intuition is wrong. INT4 quantization gets Scout onto 1× H100 (65 GB), but Maverick needs 4× H100 minimum.

Model Architecture Shifts: MoE vs Dense

The 2026 story is MoE dominance for flagship models, with dense holdouts for single-GPU deployments.

Why MoE Won the Flagship Tier

Llama 4, Qwen 3.5, and DeepSeek V3.2 all use Mixture-of-Experts. The math: with 17B active params out of 109B (Scout) or 400B (Maverick), you get 6-24x parameter efficiency at inference. But the VRAM cost is all weights — expert routing happens at runtime but experts stay loaded.

The Dense Holdout: Qwen3-32B

Qwen3-32B proves dense isn’t dead. At 33 GB FP8 (or ~18 GB INT4), it fits on a single H100 80GB with room for KV cache. Its HumanEval score of 88-90% rivals models 10x its size. For coding-focused teams without multi-GPU clusters, it’s the pragmatic choice.

Benchmarks: Where Each Model Wins

Aggregate benchmarks hide specialization. Here’s the breakdown by task.

Benchmark Best Model Score Runner-Up Score
MMLU-Pro (Advanced Knowledge) GPT-oss 120B ~90% Qwen 3.5 ~88%
GPQA Diamond (PhD Science) Qwen 3.5 88.4% Kimi K2.5 87.6%
HumanEval (Coding) Qwen3-235B-A22B 90.3% (unverified) Qwen3-32B 88.0%
IFEval (Instruction Following) Kimi K2.5 94.0% Qwen 3.5 92.6%
Chatbot Arena Elo (Human Pref) GLM-5 1454 Qwen 3.5 1450
SWE-bench Verified (Software Eng) DeepSeek V3.2 67.8% Devstral-2-123B 72.2%
AIME 2025 (Math Olympiad) DeepSeek R1 87.5% DeepSeek V3.2 89.3%

Server rack with H100 GPUs running open-source LLM inference

H100 cluster for DeepSeek R1/V3.2 — 8 GPUs minimum

Bottom line: No single model wins everything. Qwen 3.5 dominates reasoning and instruction following. DeepSeek R1/V3.2 owns math and SWE-bench. GPT-oss 120B leads MMLU-Pro. For coding, Qwen3-32B is the price/performance king.

Hardware Reality: VRAM, Quantization, Cost-per-Token

This is where procurement meets reality.

Model FP16 VRAM INT4 VRAM Min GPUs (H100 80GB) Cost/1M Tokens (H100 Spot) Throughput (tok/s)
Qwen3-32B ~66 GB ~18 GB 1× H100 ~$0.82 ~850
Llama 4 Scout ~218 GB ~55-65 GB 1× H100 ~$0.87 ~800
Llama 4 Maverick ~800 GB ~200-243 GB 4× H100 ~$2.31 ~1,200
DeepSeek V3.2 / R1 ~1.37 TB ~351 GB 8× H100 ~$9-14 ~400-600

Qwen3-32B and Llama 4 Scout are essentially tied at ~$0.85/M tokens on a single H100. Scout wins on context (10M tokens!) and reasoning; Qwen3-32B wins on coding and licensing. DeepSeek models are 12-18x more expensive per token — only justified for math/reasoning workloads where they lead.

Spot vs On-Demand Changes Everything

H100 spot pricing (~$1.03/hr) vs on-demand (~$2.50/hr) cuts batch cost-per-token by 60%. For async workloads (batch inference, eval runs), spot is a no-brainer. For latency-sensitive serving, reserve capacity.

Licensing: What “Open” Actually Means

Not all open weights are created equal.

Model License Commercial Use Modification Redistribution Caveats
Qwen 3.5 / Qwen3 Apache 2.0 Most permissive
DeepSeek R1 / V3.2 MIT Very permissive
Llama 4 Scout / Maverick Llama 4 Community ✅ (with limits) ⚠️ Restricted No competing services; 700M+ MAU requires license
GPT-oss 20B/120B OpenAI Custom ⚠️ No military; attribution required

For enterprise deployments: Apache 2.0 (Qwen) and MIT (DeepSeek) are the safest — no usage restrictions, no attribution burdens, no competing-service clauses. Llama 4’s Community License adds legal review overhead for any customer-facing product.

The Enterprise Decision Matrix

Match your constraints to the right model.

If Your Constraint Is… Deploy This Model Why
Single H100 budget, need coding Qwen3-32B Best HumanEval per dollar, Apache 2.0, fits 80GB with KV headroom
Single H100, need long context Llama 4 Scout 10M context window, fits INT4 on 80GB
4× H100 cluster, balanced quality Llama 4 Maverick Near-Scout quality with 1M context, better throughput
8× H100, math/reasoning priority DeepSeek R1 / V3.2 Unmatched GPQA, AIME, SWE-bench; MIT license
Legal team says “no custom licenses” Qwen 3.5 / DeepSeek Apache 2.0 / MIT — zero restrictions
Need instruction-following agents Qwen 3.5 / GLM-5 Top IFEval scores (92-94%)

FAQ

Can I really run Llama 4 Scout on one H100?

Yes, at INT4 quantization (~55-65 GB). You need ~65-70 GB total with KV cache overhead for 8K context. For 10M context, you’ll hit VRAM limits — the 10M claim assumes expert offloading or multi-node. For practical single-GPU: 8K-32K context is realistic.

Is Qwen3-32B actually better at coding than Llama 4 Maverick?

On HumanEval, Qwen3-32B reports 88-90% vs Maverick’s ~60-61% (Scout) / 82-86% (Maverick). But benchmark methodology varies. If coding is your primary workload, verify on your eval set. The dense architecture gives Qwen3-32B more consistent token-by-token reasoning for code.

What about GPT-oss 20B/120B?

OpenAI’s open releases are strong — GPT-oss 120B hits ~90% MMLU-Pro and 85% GPQA, rivaling Qwen 3.5. But the custom license adds compliance overhead. For teams already on OpenAI APIs, it’s a natural bridge. For pure self-hosting, Qwen/DeepSeek have cleaner licenses.

How much does quantization actually hurt quality?

INT4 typically costs 0.5-1.5% on MMLU/GPQA vs FP16. INT8 is nearly lossless. The bigger hit is on long-context tasks where lower precision accumulates error. For 10M context on Scout, some teams stay at FP8/BF16 on multi-GPU. For single-GPU 8K context, INT4 is fine.

Building Your Own LLM Stack?

From hardware sizing to quantization pipelines — we’re mapping the self-hosted landscape.

More AI Analysis →


Maya Chen
https://networkcraft.net/author/maya-chen/
AI & Technology Analyst at Networkcraft. I write for the reader who wants to understand — not just be impressed. Formerly at MIT Technology Review. Covers artificial intelligence, machine learning, and the long-term implications of frontier tech.