What is the ARC-AGI benchmark?

ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark designed to measure an AI's ability to solve novel problems it hasn't seen in its training data. Unlike standard LLM benchmarks, it focuses on 'fluid intelligence' rather than knowledge retrieval.

Why does the leaderboard include cost?

Many top-performing AI systems use 'test-time compute' (thinking longer) to solve problems. The leaderboard includes cost to show the trade-off between accuracy and the API/compute spend required to get there.

Can LLMs solve ARC tasks?

Base LLMs like GPT-4 or Claude 3.5 Sonnet actually struggle with ARC tasks on their own. The highest scores are usually achieved by systems that use the LLM as a component in a larger reasoning or program-synthesis loop.

ARC-AGI LEADERBOARD 2026: WHAT AI REASONING COSTS ACTUALLY MEAN

20/11/2025
Updated 8/6/2026
4-minute read
765 words

Updated for 2026: This analysis has been updated with the 2025 Berman solution results and current test-time compute cost models.

We’ve all seen the benchmarks where every new model claims to be “smarter” than the last. But if you’re actually building products, you know that raw capability is only half the story. The real question is: how much does that intelligence cost? I’ve been following the ARC Prize closely because it’s the only benchmark that forces us to look at the “Efficiency Frontier”—the brutal trade-off between how well a model reasons and how many dollars you’re burning to get the answer.

The ARC-AGI leaderboard isn’t just another table of percentages. It’s a map of the path to General AI, and more importantly, it’s a reality check for anyone trying to ship reasoning-heavy agents today.

Who Is This Guide For?

This is for AI Product Managers who need to budget for reasoning tasks, Research Engineers looking for the next breakthrough in test-time compute, and developers who are tired of the “vibe-check” school of model evaluation. If you care about building systems that are both smart and sustainable, this guide is for you.

By the end of this, you’ll know:

How to interpret the ARC-AGI leaderboard’s “Efficiency Frontier” chart.
Why cost-per-task is the most important metric for the next generation of AI agents.
How to use this data to shortlist models for your own reasoning-heavy applications.
What the 2025 Berman solution taught us about the power of “cheap” reasoning.

If the leaderboard doesn't load above, you can view it directly here.

Understanding the Efficiency Frontier

When you look at the chart, the X-axis shows the cost per task while the Y-axis shows performance. The “sweet spot” is the top-left corner: high performance at a low cost. Most base LLMs today sit at the bottom-left; they are cheap, but they fail at novel reasoning. On the other end, you have massive ensembles that use “test-time compute” to brute-force a solution—these are at the top-right, accurate but prohibitively expensive.

The real innovation happens when a line moves “up and to the left.” I recently wrote about how Small LLMs are the future / because they are starting to push this frontier, offering specialized reasoning capabilities without the GPT-4 price tag.

How to Read the Trends

I pay the most attention to the trend lines. When you see multiple points connected for the same model, you’re seeing the “scaling law” of reasoning. If throwing 10x more compute only buys you 2% more accuracy, that model is hitting a ceiling. But if the line is steep, it means the architecture scales well with more “thinking time.”

This is crucial for anyone following the 2025 AI reasoning breakthroughs /. We are moving from a world of “instant answers” to a world where we can pay for higher accuracy by letting the model think longer.

A Few Things to Keep in Mind

Don’t let the high scores fool you without checking the fine print. The leaderboard has a selection bias; it only shows systems below a certain cost threshold. If a system costs $1,000 to solve one logic puzzle, it won’t even appear here. Also, pricing assumptions shift—if OpenAI drops their API prices tomorrow, the entire leaderboard shifts. Always pin your pricing source if you’re using this for a budget proposal.

How to Actually Use This Data

Don’t treat the leaderboard as a final verdict. Instead, use it as your shortlist. I recommend a four-step workflow: first, find 3-4 systems in your cost band. Second, run those models against a subset of ARC tasks using your own inference setup—latency and deterministic failure modes matter more in production than a leaderboard score. Third, find the “knee” of the curve where more compute stops helping. Finally, archive your results; benchmarks like ARC-AGI-2 will soon change how efficiency is reported.

For a deeper look at where models still fail, even with high reasoning scores, check out my analysis on AI Code Assistants failing at complex debugging /.

The 2025 Berman Breakthrough

I have to mention Jeremy Berman’s 2025 solution. He achieved nearly 80% accuracy at just $8.42 per task—that is 25x more efficient than the “expensive” models that were leading before him. His secret? Evolutionary test-time compute. Instead of just asking an LLM for an answer, his system generates natural-language instructions, tests them against examples, and iteratively refines the best ones.

This proves that architecture beats raw parameter count. Berman’s code is open-source, and it’s the best “textbook” available for anyone building high-efficiency reasoning loops today.

Sources: ARC Prize leaderboard and ARC Prize testing policy .

ai research comparison