WHY AI AGENTS FAIL AT LONG TASKS: THE MAKER FRAMEWORK EXPLAINED

Updated for 2026: The MAKER framework has moved from research paper to production reference architecture. This refresh grounds the concepts in current tooling and implementation patterns.

We have all seen the demos. An AI agent writes a flawless snake game or plans a weekend holiday to Paris in seconds. But ask that same agent to migrate a production database, write a full-length novel, or execute a task requiring hundreds of sequential steps, and it fails. It drifts. It starts hallucinating variables that do not exist.

A November 2025 paper from Cognizant AI Lab, Solving a Million-Step LLM Task with Zero Errors, proves we are looking at the problem wrong. Reliability is not about smarter models. It is about better architecture.

The researchers introduced MAKER (Massively Decomposed Agentic Processes), a framework that allowed an LLM to solve a Tower of Hanoi puzzle requiring over one million steps without a single mistake. Here is how it works, and how you can apply these principles to your own agentic workflows.

Who Is This Guide For?

  • You are building autonomous agents or data pipelines and keep hitting reliability walls after a few dozen steps.
  • You have already tried bigger models and better prompts, and the failures keep happening.
  • You want an architecture that treats LLMs as unreliable components rather than oracles.

By the end of this guide, you will…

  • Understand why probability decay, not model quality, is the real enemy of long-running agents.
  • Know how to implement radical statelessness so your agents stop hallucinating past mistakes.
  • Have a concrete voting pattern you can drop into an existing codebase today.

The Brutal Maths of Probability Decay

The reason agents fail is not necessarily because they are dumb. It is because of probability decay. This is the single most important mathematical concept for AI engineers to grasp.

Imagine a state-of-the-art model that is 99.9 percent accurate at following a single instruction. That sounds production-ready.

  • 1 step: 99.9 percent success rate.
  • 10 steps: roughly 99 percent success rate.
  • 1,000 steps: 0.999 to the power of 1000 is about 36 percent.

Real-world engineering tasks, refactoring a legacy codebase or reconciling months of financial data, often require thousands of steps. At that scale, failure is mathematically guaranteed if you rely on a standard chain of thought or monolithic agent loop.

MAKER: Inverting the Architecture

MAKER solves this by treating reliability as a system design challenge, focusing on three core pillars that invert how we usually build agents.

1. Radical Statelessness (Kill the Context)

In a standard agent loop, every new action and observation gets appended to the chat history. The context grows, the model gets distracted by its own past verbosity, and drift sets in.

MAKER enforces statelessness.

  1. Isolate the Step: The agent is not given the history of the previous 500 steps. It is given only the current state of the world, for example the current file contents or database schema, and the immediate rule to apply.
  2. Execute and Die: The agent calculates the move, updates the state object, and terminates.
  3. Repeat: A fresh agent instance spins up for the next step, viewing the new state with fresh eyes.

By removing the chat history, you remove the possibility of the model getting confused by its past mistakes or rambling thoughts. The state object becomes the only memory that matters.

2. Red Flagging (Syntax as a Proxy for Logic)

The researchers discovered a fascinating quirk in LLMs: when a model is about to make a logic error, it often makes a syntax error first, or starts rambling.

If you ask for a JSON object and the model returns a paragraph of text talking about JSON, it is confused. MAKER uses a strict parser, or red flagging, to handle this.

  • Do not Repair: If the output is not perfectly formatted, or if it exceeds a token limit indicating rambling, the system immediately discards it.
  • Just Retry: It does not try to heal the JSON. It treats the syntax error as a red flag for a deeper logic error and forces a retry.

3. First-to-Ahead-by-K Voting

This is the technique that makes zero-error million-step tasks possible. For critical steps, you do not ask the model once. You ask it multiple times in parallel and use a voting algorithm derived from the Gambler’s Ruin problem.

You do not need a massive consensus. You just need one answer to be K votes ahead of the others.

  • Example: If K equals 3 and Option A has 5 votes while Option B has 2, the difference is 3. Option A wins immediately.

This allows you to utilize smaller, cheaper models, like Llama 3 8B or GPT-4o-mini, to achieve performance superior to massive reasoning models. A dumb model checked 10 times is often smarter and cheaper than a genius model checked once.

Implementation: How to Build This Today

You do not need to be building a Tower of Hanoi solver to use this. Here is the architectural workflow for applying MAKER to a data processing task.

Step 1: Define Atomic State

Stop passing strings of text between agents. Instead, define a rigid state object. This object must contain absolutely everything the agent needs to make the next decision: file contents, error logs, variable values. Do not include a chat history or a list of previous conversation turns. The agent should wake up, see the state, make a move, and shut down.

Step 2: The Voting Worker

Instead of a single API call to your LLM provider, wrap your logic in a voting loop:

  1. Parallel Fetch: Spin up 5 or more parallel requests to the model with the same prompt and state.
  2. Strict Validation: Pass every response through a syntax checker. If a response is malformed, for example invalid JSON, discard it immediately. Do not count it as a vote.
  3. Count and Compare: Tally the valid responses.
  4. Check Threshold: If the leading option is ahead of the runner-up by your defined K threshold, for example 3 votes, commit that action. If not, request more samples until a winner emerges.

Here is a simplified implementation of the voting worker:

from concurrent.futures import ThreadPoolExecutor
from pydantic import BaseModel, ValidationError

class State(BaseModel):
    file_contents: str
    instruction: str

class Action(BaseModel):
    move: str
    reasoning: str

def fetch_action(state: State) -> Action | None:
    raw = llm_call(state.instruction, state.file_contents)
    try:
        return Action.model_validate_json(raw)
    except ValidationError:
        return None  # Red flag: discard immediately

def vote_on_action(state: State, k: int = 3, max_samples: int = 15) -> Action:
    votes: dict[str, int] = {}
    samples = 0
    with ThreadPoolExecutor(max_workers=5) as pool:
        while samples < max_samples:
            futures = [pool.submit(fetch_action, state) for _ in range(5)]
            for future in futures:
                action = future.result()
                if action is None:
                    continue
                votes[action.move] = votes.get(action.move, 0) + 1
                samples += 1
                sorted_votes = sorted(votes.values(), reverse=True)
                if len(sorted_votes) >= 2 and sorted_votes[0] - sorted_votes[1] >= k:
                    return max(votes, key=votes.get)
    raise RuntimeError("No consensus reached")

Why This Matters

This paper provides an economic scaling law that favors engineering over raw compute. It proves that reliability scales logarithmically with cost when using voting.

If you are building autonomous coding agents or data pipelines, stop waiting for the model that just gets it. Start architecting systems that assume the model is unreliable. Treat LLMs as stochastic, fallible components, like a flaky network request, and build the redundancy around them.

If you want to explore other approaches to reliable AI systems, the comparison of major AI CLI coding assistants / and how Microsoft’s AutoGen framework / handles multi-agent coordination both complement the stateless, voting-backed approach MAKER advocates.

Next Steps

  1. Refactor for State: Look at your current agent. Are you relying on the message history to maintain state?
  2. Implement Voting: Identify the single most critical decision point in your workflow and wrap it in a best-of-five voter.
  3. Read the Paper: Solving a Million-Step LLM Task with Zero Errors