ARC Prize Leaderboard: AI Meets Cost Reality

The ARC Prize leaderboard is a live ranking of AI systems trying to solve abstract reasoning problems—the kind that require genuine problem-solving flexibility, not just pattern matching. But here’s the thing: it’s not just a table of accuracy numbers. The leaderboard plots both capability and cost, which means you can actually see which approaches make sense for shipping real products. This guide walks you through what to look for, what to ignore, and how to use the data to make sensible decisions about which models to test on your own problems.

Understanding what you’re looking at

ARC-AGI tests are designed to probe flexible problem-solving, not just memorised answers. The leaderboard is especially handy when your decision is about more than raw capability — it’s about whether a model can solve the right problems at a sensible cost. That makes it a good tool for product managers, researchers, and engineers who care about shipping things that actually run affordably in production.

The leaderboard shows two things at once: how well a system performs, and how much it costs to get that result. High accuracy on its own is catchy, but if it costs the earth, it might not be useful outside a lab. When you see trend lines (the same model tested at different compute or thinking budgets), they’re worth your attention — they tell you whether throwing more compute buys you much more performance. And keep an eye on entries marked as ‘preview’ or based on partial testing. They’re useful signals, but don’t take them as gospel; see the ARC testing policy for details: https://arcprize.org/policy

How to read the chart (a quick tour)

  • X-axis: cost per task (lower is better) — a practical proxy for what running the system will cost in the real world.
  • Y-axis: performance (higher is better) — how many tasks the system gets right on the ARC tests.
  • Points: individual submissions — these include base LLMs, reasoning systems, and Kaggle entries.
  • Trend lines: connected points for the same system at different resource budgets — they show marginal returns as you give a system more time or compute.

The pragmatic sweet spot is up and to the left: high performance at a modest cost. Do double-check what assumptions went into the cost estimate before you celebrate.

A few things to keep in mind

  • Selection bias: the leaderboard only shows systems below a cost threshold, so very expensive but capable systems may be hidden.
  • Partial testing and ‘preview’ tags: these are provisional placements, not final results — handle them with caution.
  • Pricing assumptions shift: vendor prices and SKUs change; a cost estimate is only accurate for the pricing it used. Pin the pricing source if you rely on a number.
  • Output completeness: when a submission fails to produce full outputs, remaining tasks are marked incorrect — that can skew the aggregated score.

How to actually use this data

Don’t treat the leaderboard as gospel. Instead, use it as a starting point for your own experiments. Here’s a sensible workflow:

  1. Short-list candidates. The leaderboard helps you narrow the field to a handful of systems in your cost band.
  2. Re-run locally. Test the relevant ARC tasks yourself (or a representative subset) using your own inference setup: same temperature, token limits and batching you plan to ship with. Archive the exact leaderboard snapshot (screenshot + HTML) and record the vendor pricing and SDK versions used.
  3. Measure reality. Wall-clock latency, deterministic failure modes, and end-to-end cost matter more than cost-per-task alone — the leaderboard is a pointer, not a promise.
  4. Find the knee. If a model sits on a trend line, test it at several reasoning/time budgets to find the point of diminishing returns. Keep a small runbook describing inference settings and any prompt engineering you used.

ARC-AGI-2 and later revisions will tweak how efficiency and cost are reported, so be cautious when comparing across versions. Major base model releases or pricing changes can shift cost estimates — plan to retest after big vendor announcements.

Further reading

If you’re curious about cost-efficiency in AI more broadly, take a look at The Future is Small LLMs — it dives into cost and deployment trade-offs. For a deeper look at where LLMs actually struggle, AI Code Assistants Failing at Complex Debugging explores some hard evaluation gaps. Both are on this site.

Sources: ARC Prize leaderboard (https://arcprize.org/leaderboard) and ARC Prize testing policy (https://arcprize.org/policy).


A note on Jeremy Berman’s 2025 solution: If you spot the entry at the top of the leaderboard, it’s worth understanding how it works. Berman achieved 79.6% on ARC-AGI-1 at just $8.42 per task—25× more efficient than much more expensive reasoning models. His approach uses evolutionary test-time compute: an AI generates natural-language instructions to solve each task, tests them against training examples, and iteratively refines the best candidates through multiple revision cycles. The method scales to 40 instruction attempts per task whilst staying cost-effective. Read his full write-up for the technical details and deeper insights on reasoning, AGI, and the limitations of current LLMs. The code is open-source on GitHub.