Inside Modern CPUs: How Your Code Actually Executes

Modern CPUs are marvels of engineering that employ sophisticated techniques to execute our code efficiently. In this deep dive, based on Matt Godbolt’s presentation “What Every Programmer Should Know about How CPUs Work” from GOTO 2024 (January), we’ll explore how modern processors actually execute our code, revealing surprising optimizations that happen beneath the surface. Understanding these concepts can help us write more performant code and debug mysterious performance issues.

The Modern CPU Pipeline: Not Your University Textbook

The traditional “fetch-decode-execute-retire” pipeline you learned in computer architecture class? That’s ancient history. Modern CPUs employ a much more sophisticated approach:

  1. In-order Front End: Instructions are fetched, decoded, and fed into the execution pipeline sequentially as they appear in the program. This front end is responsible for instruction decoding, branch prediction, and feeding the execution units.

  2. Out-of-order Execution Engine: Once decoded, instructions enter a complex scheduling system that analyzes dependencies and executes instructions when their inputs are ready, not necessarily in program order. This allows the CPU to work around stalls by executing independent instructions that come later in the program while waiting for data or memory operations.

  3. In-order Retirement: Although execution happens out of order, results must be committed to architectural state (registers and memory) in the original program order to maintain correct behavior. This component ensures that speculative execution can be rolled back if branch predictions were wrong.

This architecture allows for remarkable performance improvements through parallel execution while maintaining the illusion of sequential processing. The genius of this design is that programmers can write straightforward sequential code while the CPU handles complex parallelization automatically. Let’s dive deeper into how this works.

Branch Prediction: The CPU’s Crystal Ball

Before even starting to execute your code, modern CPUs try to predict which instructions they’ll need next. Why? Because the pipeline is dozens of cycles deep, and waiting to know for sure would waste precious time. It’s like a mining conveyor belt that’s half a mile long – you can’t wait until the end to change direction.

The branch predictor uses sophisticated techniques to guess:

  • A Branch Target Buffer (BTB) acts like a hashmap of previous branch destinations
  • Two-bit counters track the history of branch decisions
  • Both local (per-branch) and global branch history inform predictions

The accuracy is remarkable – in typical code, branch predictors achieve over 90% accuracy. This matters tremendously for performance.

Branch Prediction in Action: A Python Example

Here’s a counterintuitive demonstration. Consider this simple Python code that sums numbers and tracks those below 128:

total_sum = 0
sum_below_128 = 0
for num in numbers:  # numbers is a list of values 0-255
    total_sum += num
    if num < 128:
        sum_below_128 += num

In our initial naive testing with an unoptimized Python implementation, running this with 10 million random numbers took several seconds. But here’s the surprising part: sorting the numbers first made it significantly faster, even though we’re processing the exact same values. After optimizing our test harness (as shown in the benchmark table below), the effect is still clearly measurable though the absolute times are shorter. Why does this happen?

With random data, the branch predictor fails about 50% of the time – it can’t establish a pattern for the if num < 128 check. But with sorted data, it gets into a rhythm:

  • First half: Always true (numbers < 128)
  • Second half: Always false (numbers ≥ 128)

The CPU’s branch predictor learns this pattern, leading to dramatically fewer pipeline flushes and better performance.

Try It Yourself: Step-by-Step Guide

Let’s walk through how you can reproduce this fascinating branch prediction effect on your own machine:

  1. Set up the environment First, make sure you have Python and the Linux perf tools installed:

    # For Ubuntu/Debian
    sudo apt-get update
    sudo apt-get install linux-tools-common linux-tools-generic python3
    
    # For Fedora/RHEL
    sudo dnf install perf python3
    
  2. Create the test script Create a file named branch_prediction.py:

import time
import argparse

def read_data(filename):
    with open(filename, 'r') as f:
        numbers = [int(x) for x in f.read().strip().split('\n')]
    return numbers

def process_numbers(data):
    # Run a few times to warm up the CPU
    for _ in range(3):
        total_sum = sum(data)
    
    total_sum = 0
    sum_below_128 = 0
    start_time = time.time()
    
    for num in data:
        total_sum += num
        if num < 128:
            sum_below_128 += num
            
    end_time = time.time()
    return total_sum, sum_below_128, end_time - start_time

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Test branch prediction effects')
    parser.add_argument('--input', type=str, required=True, help='Input file with numbers')
    parser.add_argument('--sorted', action='store_true', help='Flag to indicate sorted input')
    parser.add_argument('--size', type=int, default=10_000_000, help='Number of elements')
    args = parser.parse_args()

    # Read and process data
    numbers = read_data(args.input)
    total, below, elapsed = process_numbers(numbers)
    
    # Print results
    print(f"\nResults for {len(numbers):,} elements:")
    print(f"Time:          {elapsed:.3f} seconds")
    print(f"Total sum:     {total:,}")
    print(f"Sum < 128:     {below:,}")
  1. Run the tests First, generate the test data:

    # Generate unsorted data
    shuf -i 0-255 -rn 10000000 > numbers_unsorted.txt
    
    # Generate sorted data
    sort -n numbers_unsorted.txt > numbers_sorted.txt
    

    Then run the tests:

    # Ensure the system is in a consistent state
    sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'
    sync
    echo 3 | sudo tee /proc/sys/vm/drop_caches
    sleep 2
    
    # Run each test multiple times
    for i in (seq 1 3)
      echo "Run $i - Unsorted data:"
      sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_unsorted.txt
      sleep 2
    
      echo "Run $i - Sorted data:"
      sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_sorted.txt
      sleep 2
    end
    
  2. Compare branch prediction stats Here are the benchmark results comparing sorted vs unsorted data performance:

    MetricUnsorted DataSorted DataImprovement
    Execution Time0.404s0.352s12.9% faster
    Total Sum1,274,877,0841,274,877,084identical
    Sum < 128317,595,222317,595,222identical
    Branches4,742,990,1204,782,560,106similar
    Branch Misses16,080,1001,438,83691.1% fewer
    Miss Rate0.34%0.03%91.2% better
    Total Time1.57s1.48s5.7% faster
    User Time1.27s1.18s7.1% faster
    System Time0.30s0.29s3.3% faster

    The results clearly demonstrate that:

    • Branch prediction is significantly more effective with sorted data
    • Miss rate dropped from 0.34% to just 0.03%
    • Overall execution time improved by 12.9%
    • The same computations (total sum and sum < 128) yield identical results

    When comparing results, look for:

    • Branch miss rate (branch-misses/branches ratio)
    • Execution time differences
    • Consistency across runs

    You should observe:

    • Unsorted data: ~40-50% branch miss rate
    • Sorted data: ~1-2% branch miss rate
    • Significantly faster execution with sorted data
  3. Experiment further Try different data sizes and patterns:

    # Test with different sizes
    sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_sorted.txt --sorted --size 1000000
    sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_unsorted.txt --size 1000000
    
    # Test with partially sorted data (you'll need to create this file first)
    # sort -n numbers_unsorted.txt | head -n 5000000 > numbers_partial.txt
    # cat numbers_unsorted.txt | tail -n 5000000 >> numbers_partial.txt
    python3 branch_prediction.py --input numbers_partial.txt --size 500000
    

Note: For the most accurate results:

  • Close other CPU-intensive applications
  • Run tests multiple times and average the results
  • Consider using CPU frequency scaling governor settings if available:
    sudo cpupower frequency-set -g performance  # If cpupower is installed
    

Compiler Magic: From Branches to CMOV

When we move to C++, something even more interesting happens. Modern compilers are incredibly sophisticated and can transform our code in surprising ways. Let’s look at the same logic in C++:

int total_sum = 0;
int sum_below_128 = 0;
for(int num : numbers) {
    total_sum += num;
    if(num < 128)
        sum_below_128 += num;
}

Using the Compiler Explorer (godbolt.org), we can see that the compiler completely eliminates the branch! Instead, it:

  1. Unconditionally calculates what sum_below_128 would be
  2. Uses a conditional move (CMOV) instruction to either keep or discard the result

This is brilliant because CMOV doesn’t require branch prediction – it executes in constant time regardless of the condition. The compiler made a bet that the cost of always doing the addition is less than the cost of potential branch mispredictions.

Memory Access Patterns: The Real Performance Killer

While branch prediction is fascinating, memory access patterns often have an even bigger impact on performance. Modern CPUs have a hierarchy of caches:

Cache LevelTypical SizeAccess Time
L1 Cache32KB~1ns
L2 Cache256KB~4ns
L3 CacheSeveral MB~10ns
Main MemoryGigabytes80-100ns

To demonstrate the impact, let’s look at traversing a linked list versus an array:

struct Node {
    int data;
    Node* next;
};

// Array traversal vs Linked List traversal
int sum_array(int* arr, size_t size) {
    int sum = 0;
    for(size_t i = 0; i < size; i++)
        sum += arr[i];
    return sum;
}

int sum_list(Node* head) {
    int sum = 0;
    while(head) {
        sum += head->data;
        head = head->next;
    }
    return sum;
}

With 10 million elements:

  • Array traversal: ~11ms
  • Sequential linked list: ~24ms
  • Random linked list: ~1.3 seconds!

The dramatic slowdown with random linked list traversal occurs because:

  1. The CPU can’t predict the next memory access location
  2. Each pointer dereference potentially requires a full memory access
  3. Cache prefetching becomes ineffective

Case Study: Optimizing a Bloom Filter

A perfect example of how understanding CPU behavior can lead to better performance comes from implementing a Bloom filter. A Bloom filter is a probabilistic data structure that can tell you if an element is definitely not in a set or might be in a set. It’s commonly used as a quick rejection filter before expensive operations like disk or network access.

Here’s a key optimization in Bloom filter implementation that demonstrates CPU-level thinking:

// Version 1: Traditional approach with modulo
size_t get_bucket(size_t hash) {
    return hash % table_size;  // table_size is prime
}

// Version 2: Power-of-two optimization
size_t get_bucket(size_t hash) {
    return hash & (table_size - 1);  // table_size is power of 2
}

The first version using modulo seems natural - it’s what we typically do with hash tables. However, when we analyze the CPU instructions:

  1. The modulo operation compiles to a division instruction
  2. Division on modern CPUs takes 80-100 cycles
  3. Division operations can’t be pipelined - only one can execute at a time

The second version replaces modulo with a bitwise AND operation. The tradeoff:

  • Slightly worse distribution of values (must use power of 2 size)
  • But dramatically faster execution (1-2 cycles vs 80-100 cycles)

Using LLVM’s machine code analyzer, we can see the first version taking ~148 cycles per iteration, while the optimized version takes just ~33 cycles. That’s a 4.5x improvement from understanding one CPU characteristic!

This optimization demonstrates a key principle: sometimes accepting slightly worse theoretical properties (like hash distribution) can lead to dramatically better real-world performance.

Modern Performance Analysis Tools

Understanding CPU behavior isn’t just theoretical – modern tools let us measure exactly what’s happening:

  1. perf: Linux’s built-in performance analysis tool that can count CPU events
  2. vtune: Intel’s comprehensive performance analyzer
  3. llvm-mca: LLVM Machine Code Analyzer for static analysis
  4. Compiler Explorer: Matt Godbolt’s invaluable tool for seeing compiler output

These tools can reveal:

  • Branch prediction success rates
  • Cache miss rates
  • Pipeline stalls
  • Instruction-level parallelism

Watch the Original Presentation

This article is based on Matt Godbolt’s excellent presentation “What Every Programmer Should Know about How CPUs Work” from GOTO 2024. You can watch an earlier version of this presentation here:

The specific GOTO 2024 recording will be linked here when it becomes publicly available.

The key lesson? CPUs are incredibly sophisticated, but also predictable once you understand their patterns. When writing performance-critical code, remember:

  1. Profile before optimizing
  2. Consider memory access patterns
  3. Trust but verify compiler optimizations
  4. Remember that predictable code often runs faster

Tools like perf, Compiler Explorer, and vtune make it easier than ever to understand how our code actually runs on modern hardware. While we can usually trust our compilers to make good decisions, understanding these concepts helps us write better code and solve performance problems when they arise.

Further Reading

If you found this article interesting and want to learn more, here are some excellent resources: