AI Agents Don't Crash, They Spend: LLM Cost Control

4/11/2025
5-minute read

AI agents don’t crash when they fail—they spend. Standard APM tools won’t save you. Implement Token Tracing with financial circuit breakers or risk five-figure overnight bills.

The 3 AM Wake-Up Call

You’ve deployed an autonomous coding agent to fix linting errors and update dependencies. Initial logs are green. You sleep.

At 3 AM, it hits a permission error publishing a package. Instead of failing gracefully, it hallucinates a fix: “retry with different flags.” It retries. Fails. Reads the longer error log. Adds it to context. Tries again.

while(true) loop.

July 2025: A developer’s Claude Code instance hit a recursion loop, consuming 1.67 billion tokens in 5 hours—$16,000 to $50,000 in a single incident.

Old software crashes. AI agents spend.

This isn’t prompt engineering—it’s financial architecture. Without a Financial Circuit Breaker, you’re playing expensive roulette.

Standard APM (Datadog, New Relic) tracks latency and errors brilliantly. They’re terrible at tracking cost.

The Economic Difference

Chatbots (Linear Cost):

User query → LLM response
Cost = input_tokens + output_tokens
Predictable

Agents (Exponential Cost):

User goal → agent loop: think → act → observe → repeat
Each step N carries context from steps 1 to N-1
Unpredictable, explosive growth

The Context Bloat Problem

Step 1:  1,000 tokens  → $0.002
Step 10: 10,000 tokens → $0.025
Step 50: 50,000 tokens → $0.625
Step 100: BANKRUPTCY

Every retry drags the entire conversation history. A “stuck” agent burns exponentially as confusion compounds.

Your APM sees: 50 successful HTTP 200 responses to OpenAI Reality: You just spent $200 achieving nothing

You need a new metric: Cost-per-Task

The Solution: Token Tracing with Financial Circuit Breakers

Implement cost tracking using OpenTelemetry. Attach dollar values to every span. Kill traces exceeding budget.

Architecture Overview

User Request
    ↓
[Session ID Generated]
    ↓
Agent Loop:
    ↓
[Budget Check] ← Redis lookup: current_spend
    ↓ (if under limit)
[LLM Call] → Calculate tokens → Log cost → Update Redis
    ↓
[Action Execution]
    ↓
[Budget Check] ← if over limit → KILL + Log + Alert
    ↓
Loop or Exit

Implementation: Token Tracing Middleware

# Pseudocode: Cost Tracking Decorator

@track_cost
def llm_call(prompt, model="gpt-4o"):
    # 1. Start OpenTelemetry span
    span = tracer.start_span("llm_call")
    
    # 2. Execute LLM call
    response = openai.chat(prompt, model)
    
    # 3. Calculate tokens (tiktoken or API response)
    input_tokens = count_tokens(prompt, model)
    output_tokens = count_tokens(response.content, model)
    
    # 4. Calculate cost (2025 pricing: GPT-4o example)
    cost = (input_tokens / 1M * $2.50) + (output_tokens / 1M * $10.00)
    
    # 5. Tag span with financial metadata
    span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
    span.set_attribute("gen_ai.usage.cost_usd", cost)
    span.set_attribute("session_id", current_session_id)
    
    # 6. Update session spend in Redis
    redis.incrbyfloat(f"session:{session_id}:cost", cost)
    
    span.end()
    return response

Implementation: Financial Circuit Breaker

# Pseudocode: Budget Enforcement Middleware

class FinancialCircuitBreaker:
    def __init__(self, max_cost_per_session=5.00):
        self.max_cost = max_cost_per_session
    
    def check_budget(self, session_id):
        # Fast Redis lookup
        current_spend = redis.get(f"session:{session_id}:cost") or 0.0
        
        if float(current_spend) >= self.max_cost:
            # HARD STOP: Log and kill
            logger.critical(
                f"BUDGET_EXCEEDED",
                session_id=session_id,
                spent=current_spend,
                limit=self.max_cost
            )
            
            # Return budget exhaustion to agent
            raise BudgetExhausted(
                f"Session exceeded ${self.max_cost} limit. "
                f"Spent: ${current_spend:.2f}"
            )
        
        return current_spend

# Usage in agent loop
breaker = FinancialCircuitBreaker(max_cost=5.00)

while not task_complete:
    breaker.check_budget(session_id)  # ← Kill switch fires here
    response = llm_call(prompt)
    action = execute(response)

Drop-In Tools (Less Code, Same Result)

Option 1: OpenLIT (OTel Native)

import openlit
openlit.init(otlp_endpoint="http://localhost:4318")
# Auto-instruments OpenAI, LangChain, Anthropic
# Spans automatically include gen_ai.usage.cost

Option 2: Arize Phoenix (Visual Debugging)

import phoenix as px
px.launch_app()  # Local dashboard at localhost:6006
# Real-time visualization of cost accumulation per session

Option 3: LangSmith (LangChain Ecosystem)

from langsmith import Client
client = Client()
# Automatic tracing with cost attribution for LangChain workflows

Advanced Pattern: Budget-Aware Agents

Don’t just kill loops—give agents financial awareness. Inject budget into the system prompt.

# Pseudocode: Budget-Aware System Prompt

def build_prompt(task, session_id):
    current_spend = redis.get(f"session:{session_id}:cost") or 0.0
    budget_remaining = MAX_BUDGET - float(current_spend)
    
    system_prompt = f"""
    You are a task execution agent with a FINANCIAL CONSTRAINT.
    
    Budget Remaining: ${budget_remaining:.2f}
    
    Rules:
    - Each API call costs money
    - If budget drops below $0.50, you MUST wrap up
    - Prioritize cheap operations (reading > writing > LLM calls)
    - If you cannot complete the task within budget, explain why
    
    Task: {task}
    """
    
    return system_prompt

Result: Agent self-regulates. When budget is low, it summarizes progress instead of burning the last dollar on another failed retry.

Real-World Pattern: High-Cost Agent Workflows

Unoptimized agent workflows can accumulate significant costs through multi-step processes that spiral in token usage.

Common Failure Modes:

Complex tasks create exponential cost curves as context windows grow
Each retry carries the entire conversation history through every step
Without circuit breakers, a single agent task can silently burn through budgets

Architecture Fixes:

Financial Circuit Breaker: Hard caps per task attempt (e.g., $5 limit)
Model Routing: Use cheaper models (Gemini 1.5 Flash, GPT-4o-mini) for intermediate steps, expensive models only for final synthesis
Context Management: Prune conversation history regularly, keep only essential decision points

These patterns prevent runaway costs by stopping expensive loops before they spiral.

Key Insight: Agents don’t need complete thought histories—they need decision histories. Compress aggressively.

Conclusion

In 2025, reliability isn’t just uptime—it’s financial survivability. You cannot ship autonomous agents without billing limits.

The shift: From “did it work?” to “what did it cost?”

Traditional monitoring tells you when services are slow or broken. Financial monitoring tells you when services are expensive. Both are required.

Start with a $5 circuit breaker. Your CFO will thank you.

Further Reading:

Tools Mentioned:

OpenLIT - OTel auto-instrumentation
Arize Phoenix - Local cost visualization
LangSmith - LangChain tracing platform

java microservices performance architecture resources