THE RELIABILITY TAX: WHY LLMS BREAK DETERMINISTIC SYSTEMS

26/12/2025
5-minute read
977 words

Every engineer eventually learns that adding complexity to a reliable system usually makes it less reliable. Yet somehow, we collectively forgot this lesson when it came to AI.

The smart home industry just ran a large-scale experiment on this principle. Millions of users upgraded to LLM-powered voice assistants—Alexa Plus, Gemini for Home—expecting smarter automation. They got the opposite: lights that sometimes don’t turn on, routines that randomly fail, and coffee machines that need multiple attempts to start.

This isn’t a smart home problem. It’s a pattern that will repeat across every domain where we replace deterministic systems with probabilistic ones. If you’re building AI-augmented products, understanding why this happened—and what the actual trade-offs are—saves you from repeating it.

The Determinism You Didn’t Know You Had

The old voice assistants weren’t intelligent. They were pattern matchers with a decision tree. You said “turn on kitchen lights” and the system:

Matched “turn on” to an action
Matched “kitchen lights” to a device group
Executed a predefined API call

This is a state machine. Same input, same output, every time. No interpretation, no creativity, no failure modes beyond “I didn’t recognise that.”

Developers take this determinism for granted in most systems. Your HTTP server doesn’t sometimes decide to return a 500 for valid requests. Your database doesn’t occasionally interpret SELECT * FROM users as a delete operation. Reliability is assumed.

LLMs break this assumption fundamentally.

Stochastic Systems Can’t Guarantee Outcomes

When you replace a state machine with an LLM, you’re swapping a lookup table for a probability distribution. The model doesn’t execute logic—it predicts the most likely next token given everything it’s seen.

This matters because:

Temperature isn’t just about creativity. Even at temperature 0, LLMs aren’t truly deterministic. Floating-point operations across different hardware can produce slight variations. Token probability distributions have ties that get broken arbitrarily. The same prompt can legitimately produce different outputs.

Context windows create state. Unlike a stateless API call, LLM responses depend on conversation history, system prompts, and sometimes even the order of tokens in your request. Two semantically identical requests can get different responses based on subtle phrasing differences.

Function calling adds composition errors. When an LLM needs to call an API, it doesn’t just route to the right endpoint—it has to generate the entire function signature. Every parameter is a prediction, not a lookup.

# State machine approach: deterministic routing
def handle_command(command):
    if match(command, "turn on {device}"):
        return api.device_on(extract_device(command))
    # ... more patterns

# LLM approach: probabilistic generation
def handle_command_llm(command):
    function_call = llm.generate(
        f"Generate API call for: {command}",
        available_functions=[device_on, device_off, ...]
    )
    # function_call might vary between runs
    return execute(function_call)

The state machine will always call device_on("kitchen_lights") for the same input. The LLM might occasionally generate device_on("kitchen light") (singular) or include an extra parameter or format the device name differently.

The Capability-Reliability Trade-off

So why did Amazon and Google ship this? Because the upside is genuinely compelling.

The old assistants couldn’t chain actions. They couldn’t handle ambiguity. They couldn’t learn from context. You had to memorise exact phrases and program every automation manually.

LLM-powered assistants can theoretically understand “prepare the house for guests” and figure out what that means for your specific setup—adjusting lights, temperature, music, and running appliance routines without you specifying each step.

This is what makes them valuable. But it comes at a cost that the industry didn’t properly communicate: you’re trading reliability for capability.

The question every product team should ask before adding LLMs to a working system:

Is the expanded capability worth accepting that the same request might fail 5% of the time when it previously failed 0.1% of the time?

For creative tasks—writing, brainstorming, analysis—the answer is obviously yes. For home automation that runs while you’re asleep? Much less clear.

Patterns for Managing the Trade-off

If you’re building AI-augmented systems, here’s what I’ve learned works:

1. Keep Deterministic Fallbacks

Don’t replace your state machine—wrap it. Use the LLM for understanding and the deterministic system for execution.

def handle_command(command):
    # LLM interprets the intent
    intent = llm.classify(command, intents=["device_control", "query", ...])
    
    if intent == "device_control":
        # Deterministic system handles execution
        device = fuzzy_match(extract_device(command), known_devices)
        action = lookup_action(extract_action(command))
        return execute_deterministic(device, action)
    else:
        # LLM handles open-ended queries
        return llm.respond(command)

Use LLMs for what they’re good at (understanding natural language) and deterministic systems for what they’re good at (reliable execution).

2. Validate Before Executing

Never trust LLM-generated function calls without validation. Build a schema layer that rejects malformed calls:

def execute_function_call(call):
    validated = validate_against_schema(call)
    if validated.has_errors:
        # Ask LLM to retry with specific error feedback
        return retry_with_feedback(call, validated.errors)
    return execute(validated.call)

3. Make Failures Explicit

Users tolerate failures they understand. “I couldn’t find a device called ‘kitchen light’” is actionable. Silent failures or wrong actions erode trust quickly.

4. Measure Reliability as a First-Class Metric

If you’re adding LLMs to a system that was previously deterministic, track the reliability regression explicitly:

Success rate for identical requests
Variance in response time
Error types and frequency

Set thresholds. If reliability drops below acceptable levels, the feature isn’t ready.

The Broader Lesson

Smart homes are just the first high-profile example of this pattern. The same trade-off will play out in:

Code assistants that sometimes misinterpret your intent
Customer service bots that occasionally give wrong information
Workflow automation that handles edge cases creatively (and incorrectly)
API gateways that use LLMs for routing decisions

The engineering challenge of the next few years isn’t making LLMs more capable—they’re already remarkably capable. It’s building systems that use LLMs where they add value while preserving the reliability guarantees that users (reasonably) expect.

For more on how models actually process information and where they fail, see how models think /. For the security implications of these probabilistic systems, check out AI reasoning breakthroughs and vulnerabilities /.

The companies that figure this out will build products that feel magical. The ones that don’t will ship smart homes that can’t reliably make coffee.

ai research