AI Agents Don't Crash, They Spend: LLM Cost Control
AI agents don’t crash when they fail—they spend. Standard APM tools won’t save you. Implement Token Tracing with financial circuit breakers or risk five-figure overnight bills.
The 3 AM Wake-Up Call
You’ve deployed an autonomous coding agent to fix linting errors and update dependencies. Initial logs are green. You sleep.
At 3 AM, it hits a permission error publishing a package. Instead of failing gracefully, it hallucinates a fix: “retry with different flags.” It retries. Fails. Reads the longer error log. Adds it to context. Tries again.
while(true) loop.
July 2025: A developer’s Claude Code instance hit a recursion loop, consuming 1.67 billion tokens in 5 hours—$16,000 to $50,000 in a single incident.
Old software crashes. AI agents spend.
This isn’t prompt engineering—it’s financial architecture. Without a Financial Circuit Breaker, you’re playing expensive roulette.
Why Your Dashboards Are Blind
Standard APM (Datadog, New Relic) tracks latency and errors brilliantly. They’re terrible at tracking cost.
The Economic Difference
Chatbots (Linear Cost):
- User query → LLM response
- Cost =
input_tokens + output_tokens - Predictable
Agents (Exponential Cost):
- User goal → agent loop: think → act → observe → repeat
- Each step N carries context from steps 1 to N-1
- Unpredictable, explosive growth
The Context Bloat Problem
Step 1: 1,000 tokens → $0.002
Step 10: 10,000 tokens → $0.025
Step 50: 50,000 tokens → $0.625
Step 100: BANKRUPTCY
Every retry drags the entire conversation history. A “stuck” agent burns exponentially as confusion compounds.
Your APM sees: 50 successful HTTP 200 responses to OpenAI Reality: You just spent $200 achieving nothing
You need a new metric: Cost-per-Task
The Solution: Token Tracing with Financial Circuit Breakers
Implement cost tracking using OpenTelemetry. Attach dollar values to every span. Kill traces exceeding budget.
Architecture Overview
User Request
↓
[Session ID Generated]
↓
Agent Loop:
↓
[Budget Check] ← Redis lookup: current_spend
↓ (if under limit)
[LLM Call] → Calculate tokens → Log cost → Update Redis
↓
[Action Execution]
↓
[Budget Check] ← if over limit → KILL + Log + Alert
↓
Loop or Exit
Implementation: Token Tracing Middleware
# Pseudocode: Cost Tracking Decorator
@track_cost
def llm_call(prompt, model="gpt-4o"):
# 1. Start OpenTelemetry span
span = tracer.start_span("llm_call")
# 2. Execute LLM call
response = openai.chat(prompt, model)
# 3. Calculate tokens (tiktoken or API response)
input_tokens = count_tokens(prompt, model)
output_tokens = count_tokens(response.content, model)
# 4. Calculate cost (2025 pricing: GPT-4o example)
cost = (input_tokens / 1M * $2.50) + (output_tokens / 1M * $10.00)
# 5. Tag span with financial metadata
span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
span.set_attribute("gen_ai.usage.cost_usd", cost)
span.set_attribute("session_id", current_session_id)
# 6. Update session spend in Redis
redis.incrbyfloat(f"session:{session_id}:cost", cost)
span.end()
return response
Implementation: Financial Circuit Breaker
# Pseudocode: Budget Enforcement Middleware
class FinancialCircuitBreaker:
def __init__(self, max_cost_per_session=5.00):
self.max_cost = max_cost_per_session
def check_budget(self, session_id):
# Fast Redis lookup
current_spend = redis.get(f"session:{session_id}:cost") or 0.0
if float(current_spend) >= self.max_cost:
# HARD STOP: Log and kill
logger.critical(
f"BUDGET_EXCEEDED",
session_id=session_id,
spent=current_spend,
limit=self.max_cost
)
# Return budget exhaustion to agent
raise BudgetExhausted(
f"Session exceeded ${self.max_cost} limit. "
f"Spent: ${current_spend:.2f}"
)
return current_spend
# Usage in agent loop
breaker = FinancialCircuitBreaker(max_cost=5.00)
while not task_complete:
breaker.check_budget(session_id) # ← Kill switch fires here
response = llm_call(prompt)
action = execute(response)
Drop-In Tools (Less Code, Same Result)
Option 1: OpenLIT (OTel Native)
import openlit
openlit.init(otlp_endpoint="http://localhost:4318")
# Auto-instruments OpenAI, LangChain, Anthropic
# Spans automatically include gen_ai.usage.cost
Option 2: Arize Phoenix (Visual Debugging)
import phoenix as px
px.launch_app() # Local dashboard at localhost:6006
# Real-time visualization of cost accumulation per session
Option 3: LangSmith (LangChain Ecosystem)
from langsmith import Client
client = Client()
# Automatic tracing with cost attribution for LangChain workflows
Advanced Pattern: Budget-Aware Agents
Don’t just kill loops—give agents financial awareness. Inject budget into the system prompt.
# Pseudocode: Budget-Aware System Prompt
def build_prompt(task, session_id):
current_spend = redis.get(f"session:{session_id}:cost") or 0.0
budget_remaining = MAX_BUDGET - float(current_spend)
system_prompt = f"""
You are a task execution agent with a FINANCIAL CONSTRAINT.
Budget Remaining: ${budget_remaining:.2f}
Rules:
- Each API call costs money
- If budget drops below $0.50, you MUST wrap up
- Prioritize cheap operations (reading > writing > LLM calls)
- If you cannot complete the task within budget, explain why
Task: {task}
"""
return system_prompt
Result: Agent self-regulates. When budget is low, it summarizes progress instead of burning the last dollar on another failed retry.
Real-World Pattern: High-Cost Agent Workflows
Unoptimized agent workflows can accumulate significant costs through multi-step processes that spiral in token usage.
Common Failure Modes:
- Complex tasks create exponential cost curves as context windows grow
- Each retry carries the entire conversation history through every step
- Without circuit breakers, a single agent task can silently burn through budgets
Architecture Fixes:
- Financial Circuit Breaker: Hard caps per task attempt (e.g., $5 limit)
- Model Routing: Use cheaper models (Gemini 1.5 Flash, GPT-4o-mini) for intermediate steps, expensive models only for final synthesis
- Context Management: Prune conversation history regularly, keep only essential decision points
These patterns prevent runaway costs by stopping expensive loops before they spiral.
Key Insight: Agents don’t need complete thought histories—they need decision histories. Compress aggressively.
Conclusion
In 2025, reliability isn’t just uptime—it’s financial survivability. You cannot ship autonomous agents without billing limits.
The shift: From “did it work?” to “what did it cost?”
Traditional monitoring tells you when services are slow or broken. Financial monitoring tells you when services are expensive. Both are required.
Start with a $5 circuit breaker. Your CFO will thank you.
Further Reading:
Tools Mentioned:
- OpenLIT - OTel auto-instrumentation
- Arize Phoenix - Local cost visualization
- LangSmith - LangChain tracing platform