Get In Touch
541 Melville Ave, Palo Alto, CA 94301,
ask@ohio.clbthemes.com
Ph: +1.831.705.5448
Work Inquiries
work@ohio.clbthemes.com
Ph: +1.831.306.6725
Back

Gemini 3.1 Pro Hits 77% on ARC-AGI-2: What That Score Actually Means

AI Benchmarks & Performance · Feb 19, 2026
Gemini 3.1 Pro Hits 77% on ARC-AGI-2: What That Score Actually Means
ARC-AGI-2 is the benchmark designed to be impossible for AI. Gemini 3.1 Pro just scored 77%. Here is what that number actually means — and what it doesn’t.
MC
Maya Chen
AI Researcher & Benchmarks Editor

77.1% ARC-AGI-2
1M token context
~95% human average
<10% previous AI

AI artificial intelligence visualization with neural network data streams

Key insight: ARC-AGI-2 was specifically engineered by François Chollet to be resistant to AI memorization and pattern-matching. Humans average 95%. Every major AI lab spent years barely cracking 10% on ARC-AGI-1. Gemini 3.1 Pro just posted 77.1% on the harder successor. That requires explanation.

What Is ARC-AGI-2 and Why Chollet Designed It

François Chollet, the deep learning researcher best known for creating Keras, developed the Abstraction and Reasoning Corpus (ARC) with a single goal: create a benchmark that cannot be solved by memorization. Standard benchmarks like MMLU or HumanEval are conquered the moment a model is trained on enough internet data. ARC tasks require genuine visual and logical reasoning applied to novel patterns never seen before.

ARC-AGI-2 raised the bar further. Tasks are more abstract, grids more complex, and the evaluation methodology tighter. The intentional design makes brute-force scaling fail — you cannot throw more compute at a novel visual reasoning puzzle you have never seen.

NVIDIA GPU AI chip technology powering next-generation artificial intelligence

When Chollet announced ARC-AGI-2, he estimated that achieving human-level performance (~95%) would require architectural breakthroughs, not just scale. February 2026 suggests that timeline may be compressing faster than he anticipated.

data research analysis technology insights visualization dashboard

77.1%: How to Interpret a Number Never Before Hit

A few clarifications are important. Gemini 3.1 Pro scored 77.1% as a non-ensemble, single-pass model — the most honest measurement category. Ensemble methods (running a task many times and taking the majority answer) can inflate scores significantly. The 77.1% figure is for a single model pass.

~95%
Human average on ARC-AGI-2
77.1%
Gemini 3.1 Pro (single-pass)
<70%
GPT-5.2 (unofficial)
<10%
Previous AI on ARC-AGI-1

What makes this genuinely remarkable is the jump. Previous best-in-class models were struggling to break 50% on ARC-AGI-2. A 77.1% score by a single model — not an ensemble — represents the largest single-generation jump in the benchmark’s history. It strongly suggests that multimodal training at scale, combined with extended context reasoning, is producing qualitatively different reasoning capabilities.

Gemini 3.1 Pro vs Claude 4.6 vs GPT-5.2

With the three major frontier labs all having released or previewed their latest flagship models in Q1 2026, direct benchmark comparison is possible for the first time. Keep in mind that GPT-5.2 and Claude 4.6 figures represent a mix of official and third-party evaluations.

Model ARC-AGI-2 MMLU HumanEval Context
Gemini 3.1 Pro 77.1% 93.2% 91.4% 1M tokens
GPT-5.2 <70% (unoff.) 92.8% 90.1% 256K tokens
Claude 4.6 Opus ~65% (est.) 91.5% 89.7% 200K tokens
Gemini 3 Deep Think N/A (spec.) 94.1% 93.8% 1M tokens
Human average ~95%

* GPT-5.2 ARC-AGI-2 score is unofficial third-party evaluation. Claude 4.6 Opus figure is an estimate based on available ARC-AGI-1 performance and architectural scaling. Gemini 3 Deep Think has not released an official ARC-AGI-2 score.

The Multimodal Advantage

Gemini 3.1 Pro is a natively multimodal model: text, image, speech, and video are not bolt-ons but core training modalities. ARC-AGI-2 tasks are fundamentally visual pattern-recognition and spatial reasoning challenges. The hypothesis gaining traction among researchers is that multimodal grounding — learning to reason about the world through images, not just text — may be producing more robust abstract reasoning capabilities than text-only pretraining can.

Separately, the 1M token context window enables a form of in-context reasoning that shorter-window models cannot perform: loading entire problem contexts, reference examples, and chain-of-thought reasoning chains simultaneously. Whether this contributes materially to ARC-AGI-2 performance is an active research question.

Also worth noting from February 19: Taalas HC1 announced an ASIC chip that hardwires Llama 3.1 8B at 17,000 tokens per second at 20× lower cost than GPU inference, and Nvidia DGX Station opened pre-orders for its ~$100K desktop AI workstation. The hardware layer is scaling to match the software breakthroughs.

Bottom Line
77.1% on ARC-AGI-2 is the most credible evidence of non-trivial AI reasoning advancement published in 2026. It does not mean AGI is here. It means the specific capability Chollet designed this benchmark to measure — fluid intelligence applied to novel abstract patterns — is now genuinely present in frontier AI systems.

MC
Maya Chen
AI Researcher covering benchmarks, frontier models, and the path to AGI.
Maya Chen
https://networkcraft.net/author/maya-chen/
AI & Technology Analyst at Networkcraft. I write for the reader who wants to understand — not just be impressed. Formerly at MIT Technology Review. Covers artificial intelligence, machine learning, and the long-term implications of frontier tech.