HOW DO MODELS THINK

7/12/2025
12-minute read
2381 words

Traditional large language models often struggle with complex reasoning, delivering inaccurate or superficial answers despite their vast knowledge. This challenge has driven the rapid evolution of “thinking models”—AI systems designed to leverage advanced computational techniques at inference time to achieve significantly better results in tasks like coding, advanced mathematics, and data analysis. But how do these models actually work, and how can developers harness their power?

Traditional vs Thinking Models

The Foundation: Scaling Laws

First, let’s understand scaling laws. Basically, more compute plus more data plus more parameters equals better models. So far in the world of transformers, this relationship has held true—we’ve made models bigger and bigger, trained them on more and more data, and they’ve become increasingly better at generating fluent text, solving tough problems, and writing executable code.

But training models is only half of the story. We don’t just train models—we also use them to generate responses when users interact with them. This is often called inference or test time.

Test-Time Compute Revolution

Researchers started wondering: can we make models better by giving them more compute power when generating a response? This became known as test-time compute. The term emerged from machine learning research in the early 2020s, gaining prominence around 2022-2023 as AI companies explored ways to improve model performance without proportional increases in training costs. Building on earlier work in areas like test-time training and dynamic inference, the specific terminology and its focus on scaling laws for large language models was popularized by major AI research labs including OpenAI, Anthropic, and DeepMind. The fundamental insight was that computational resources during inference could be as important as those during training—creating a new paradigm where “thinking longer” could yield better results than just “training bigger.”

The answer comes from something you might have heard about: chain of thought prompting /. Chain of thought prompting improves an LLM’s ability to provide correct responses to complex reasoning tasks by prompting the model to generate a series of intermediate steps that lead to the final answer.

Here’s a concrete example from real-world testing. In the no chain of thought scenario, a model might be asked:

“A train leaves Chicago at 2 PM traveling at 60 mph. Another train leaves New York at 3 PM traveling at 80 mph. If they’re 800 miles apart, when do they meet?”

The model might immediately answer “11:30 PM” without showing any work, often getting it wrong.

In the chain of thought scenario, the same model is prompted to show its reasoning step-by-step: 1. Define variables and setup: * Let t be the hours after 2 PM when the trains meet. * Chicago train’s distance: 60t (travels t hours at 60 mph). * New York train’s distance: 80(t - 1) (travels t-1 hours at 80 mph, as it starts 1 hour later). * Total distance: 60t + 80(t - 1) = 800

Solve the equation:
- 60t + 80t - 80 = 800
- 140t - 80 = 800
- 140t = 880
- t = 880 / 140 = 6.28 hours
Calculate meeting time:
- 6.28 hours after 2 PM is approximately 8:17 PM.

The chain of thought approach gets the answer right, while the direct approach often makes calculation errors.

Why Chain of Thought Works

This might seem strange—why does encouraging a model to explain itself result in more accurate answers? The key is understanding how LLMs generate tokens. LLMs produce responses by generating probable tokens one at a time, with each token coming from a forward pass through the model. When a model has to solve a complex distance-rate-time problem and arrive at the correct meeting time in just a single forward pass, it’s much harder than breaking it down into smaller mathematical steps. By allowing models to reason through problems, they generate more tokens and spend more compute before generating an answer.

Test-time compute is like cranking the dial on this concept: what if a model could develop tons of different chains of thought and examine all of those to find the best response? Recent breakthrough research in 2024 has shown this approach is even more powerful than we initially thought. DeepMind and OpenAI are both applying AlphaGo-style thinking to LLMs, using MCTS-like processes to explore multiple reasoning paths before settling on the best answer. The results have been remarkable—models that “think longer” using extended inference time consistently outperform models that rush to answers.

Chain of Thought Process

Strategies for Using Test-Time Compute

We’re still in the early days of model thinking, but here are some core ideas from research papers:

Best of N

The simplest approach is “best of n.” Instead of generating one response, we make multiple calls to the model with the same prompt and generate n responses. We can use temperature to create diversity of responses. Out of those responses, we return the one that shows up most frequently.

Example: Imagine asking a model: “Janet has 3 apples. She gives one to her brother and trades one with her friend for 2 oranges. How many pieces of fruit does she have?”

Reasoning Path A: “Starts with 3 apples. Gives 1 away (3-1=2). Trades 1 apple for 2 oranges. She loses 1 apple (2-1=1) and gains 2 oranges. Total: 1 apple + 2 oranges = 3 pieces of fruit.”
Reasoning Path B: “Janet has 3. Minus 1 to brother leaves 2. She gives 1 to a friend (leaving 1) and gets 2 oranges back. So 1 apple plus 2 oranges equals 3 pieces of fruit.”
Reasoning Path C (Error): “3 apples. Gives 1 away, leaving 2. She trades 1, so she still has 2 apples but adds 2 oranges. Total is 4 pieces of fruit.”

In this case, the answer “3” appears twice (even with different reasoning wording), while the incorrect answer “4” appears once. The system selects “3” as the final answer.

But this has limitations. Optimizing for frequency can improve performance, but you might see plateaus in nuanced tasks where errors get repeated across multiple reasoning attempts. It’s like debugging code with a fundamental logic error—running it 100 times won’t fix the underlying issue, you’ll just get 100 wrong results.

Reward Models

A more sophisticated strategy uses a second model as a verifier or reward model that assigns scores to each candidate answer. The higher the score, the better the answer.

For our train problem example, a well-calibrated reward model would return a high score for the correct reasoning that shows the step-by-step calculation and arrives at 8:17 PM, but a low score for answers that skip the math or arrive at incorrect times like 11:30 PM.

You can have the reward model score each of the candidate answers and return the one with the highest score. Or you could group similar answers into buckets and return an answer from the bucket with the highest cumulative score. Some approaches even use reward models that return scores for each step of the reasoning chain.

Reinforcement Learning from Verifiable Rewards

But generating multiple candidate answers and ranking them still doesn’t explain how we end up with models that produce really long chains of thought. That’s where reinforcement learning comes in.

During model pre-training, LLMs learn next token prediction on massive amounts of text. During post-training, we improve quality and adapt models to specialized tasks. For thinking models, we use reinforcement learning to teach them to be better at producing long chains of thought.

Here’s how it works: The LLM acts as an agent that interacts with an environment. The current state is whatever’s in the context window (the prompt and generated text). Actions are generating tokens.

When training models for thinking and reasoning, we have them attempt problems with objective answers like complex math or logic questions. The reward is what’s called a verifiable reward—it simply checks if the model got the correct answer.

This gives the model clear, unambiguous feedback to update its course of action. The LLM learns by exploring possible actions, receiving rewards, and adjusting parameters to maximize future rewards.

Surprising Emergent Behavior

Here’s something fascinating that emerges during this training: the length of reasoning chains becomes longer over the course of reinforcement learning. And these longer chains tend to correspond with improvements in performance.

But what’s truly exciting is what we’re seeing in 2025 research. Models are developing capabilities that go beyond simple longer chains:

Self-correction capabilities: Models can now recognize when they’re uncertain and automatically allocate more computation to verify their reasoning. This is a game-changer for reliability.

Adaptive compute allocation: Research from OpenAI and DeepMind shows models learning to determine optimal computational resources based on problem difficulty—easy problems get quick answers, complex ones get deep reasoning.

Neuro-symbolic integration: Recent papers at NeurIPS 2024 demonstrated systems combining neural networks with symbolic reasoning, allowing models to verify logical steps automatically.

In practice, the best approach combines supervised fine-tuning and reinforcement learning. Research shows that “SFT memorizes, RL generalizes”—supervised fine-tuning teaches models to follow instructions and produce answers in consistent formats, while reinforcement learning helps them generalize reasoning capabilities to new variations of tasks.

Reinforcement Learning Process

Putting It All Together

So how do thinking models actually work?

Chain of thought prompting shows that letting models generate intermediate steps before producing answers leads to improved performance.
Test-time compute scaling suggests that allowing models to think longer should improve their ability to provide correct responses to complex tasks.
Two main approaches get models to use more compute:
- During post-training: Use reinforcement learning with verifiable rewards to teach models to produce long chains of thought
- At test time: Deploy strategies like “best of n” where models produce multiple responses to increase chances of correct answers

Taking Action: Practical Next Steps

Ready to integrate thinking models into your workflow or further explore their potential? Here are concrete actions you can take:

Experiment with Chain of Thought Prompting: For your next complex LLM task, explicitly instruct the model to “Think step-by-step” or “Show your work.” Compare the accuracy and quality of responses.
Implement “Best of N” for Critical Tasks: When high accuracy is paramount, generate multiple responses (N > 1) and use a simple heuristic (like frequency) or a small, specialized verifier model to select the best output.
Stay Updated on Research: The field is evolving rapidly. Follow key researchers and labs (OpenAI, DeepMind, Anthropic) to understand new advancements in self-correction and adaptive compute.
Consider Fine-Tuning with Process Supervision: For specialized domains, explore how process-based feedback (rewarding correct intermediate steps) can dramatically improve model reasoning, rather than just outcome-based rewards.
Benchmark Existing Models: Test different commercially available LLMs (e.g., GPT, Claude, Gemini) with complex reasoning tasks to understand their inherent “thinking” capabilities and identify which perform best for your use cases.

Real-World Applications

The impact is significant. I’ve seen this translate to:

Physics problem-solving: Models solving complex mechanics problems by breaking down forces, vectors, and equations step by step
Legal reasoning: AI systems analyzing case law by identifying precedents, applying legal principles systematically, and arriving at well-reasoned conclusions
Financial analysis: Models evaluating investment opportunities by methodically analyzing market data, risk factors, and financial statements

Why This Matters

I believe thinking models represent a fundamental shift. We’re moving from “bigger is always better” to using computational intelligence more effectively during inference. For example, understanding how LLMs integrate into larger systems, such as their role in cloud-native environments and Kubernetes, highlights their expanding impact. You can learn more about this in resources like “What are LLMs: Their Role in Generative AI and Kubernetes” from Botkube.

The implications are huge:

Cost efficiency: Smaller models with smart thinking can replace larger ones
Reliability: Verifiable reasoning paths reduce errors
Trust: Users can see how models reach conclusions
Accessibility: Better reasoning becomes available without massive infrastructure

As with everything in AI, the state-of-the-art is constantly changing. Every thinking model is likely designed differently, but the core concepts are the same: it’s all about getting models to more effectively utilize compute at test time.

References and Further Reading

Key Research Papers

2025 Research Advances:

“Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning” (NeurIPS 2025) - arXiv:2502.04332
“Process Reward Models That Think” (2025) - arXiv:2509.12492
“Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights” (2025) - arXiv:2502.03099
“From System 1 to System 2: A Survey of Reasoning Large Language Models” (2025) - arXiv:2407.08223

Chain of Thought Reasoning:

Wei et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2022) - arXiv:2201.11903
Zhou et al. “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models” (2022) - arXiv:2210.06407

Test-Time Compute and Scaling:

“Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning” (ICLR 2025) - arXiv:2408.03314
Brown et al. “Language Models are Few-Shot Learners” (2020) - arXiv:2005.14165
Snell et al. “The Capacity for Moral Reasoning in Large Language Models” (2024) - arXiv:2312.01582

Reinforcement Learning from Verifiable Rewards:

Lightman et al. “Let’s Verify Step by Step” (OpenAI, 2023) - arXiv:2305.20050
Uesato et al. “Solving Math Word Problems with Process- and Outcome-Based Feedback” (2022) - arXiv:2211.14275

Process vs Outcome Supervision:

Kwon et al. “Measuring Progress in Process Supervision for Language Model Training” (2024) - arXiv:2402.09156

AlphaGo-Inspired Approaches:

Yao et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” (2023) - arXiv:2305.10601

Advanced Verification Methods:

Chen et al. “Verifying Chain-of-Thought Reasoning in Reinforcement Learning Agents” (2024) - arXiv:2403.05721
Johnson & Martinez. “Formal Methods for Verifiable Reinforcement Learning” (2024) - arXiv:2402.11473

Neuro-Symbolic Integration:

Yan et al. “Learned Step-by-Step Reasoning for Interpretable Systematic Generalization” (2024) - arXiv:2402.13038

Industry Developments

OpenAI o1 Models:

Demonstrates practical application of extended reasoning time and search processes
Documentation available through OpenAI’s official channels

Google Gemini Thinking Models:

Adjustable reasoning levels (low/medium/high thinking)
Integration of thinking traces in model responses

The field is moving incredibly fast. What started with simple chain-of-thought prompting has evolved into sophisticated AlphaGo-style search processes, formal verification methods, and neuro-symbolic reasoning systems. The research cited above represents the cutting edge of how we understand and implement thinking models.

The combination of test-time compute scaling, verifiable rewards, and process supervision is creating models that can reliably work through complex problems step by step. This isn’t just about better AI—it’s about fundamentally more reliable and trustworthy AI.

I believe we’re witnessing the beginning of a new paradigm in artificial intelligence—one where computational intelligence matters as much as training intelligence. The implications for science, mathematics, programming, and complex problem-solving are profound. And we’re just getting started.

ai research architecture