Inside Modern CPUs: How Your Code Actually Executes

15/3/2025
10-minute read

Modern CPUs are marvels of engineering that employ sophisticated techniques to execute our code efficiently. In this deep dive, based on Matt Godbolt’s presentation “What Every Programmer Should Know about How CPUs Work” from GOTO 2024 (January), we’ll explore how modern processors actually execute our code, revealing surprising optimizations that happen beneath the surface. Understanding these concepts can help us write more performant code and debug mysterious performance issues.

The Modern CPU Pipeline: Not Your University Textbook

The traditional “fetch-decode-execute-retire” pipeline you learned in computer architecture class? That’s ancient history. Modern CPUs employ a much more sophisticated approach:

In-order Front End: Instructions are fetched, decoded, and fed into the execution pipeline sequentially as they appear in the program. This front end is responsible for instruction decoding, branch prediction, and feeding the execution units.
Out-of-order Execution Engine: Once decoded, instructions enter a complex scheduling system that analyzes dependencies and executes instructions when their inputs are ready, not necessarily in program order. This allows the CPU to work around stalls by executing independent instructions that come later in the program while waiting for data or memory operations.
In-order Retirement: Although execution happens out of order, results must be committed to architectural state (registers and memory) in the original program order to maintain correct behavior. This component ensures that speculative execution can be rolled back if branch predictions were wrong.

This architecture allows for remarkable performance improvements through parallel execution while maintaining the illusion of sequential processing. The genius of this design is that programmers can write straightforward sequential code while the CPU handles complex parallelization automatically. Let’s dive deeper into how this works.

Branch Prediction: The CPU’s Crystal Ball

Before even starting to execute your code, modern CPUs try to predict which instructions they’ll need next. Why? Because the pipeline is dozens of cycles deep, and waiting to know for sure would waste precious time. It’s like a mining conveyor belt that’s half a mile long – you can’t wait until the end to change direction.

The branch predictor uses sophisticated techniques to guess:

A Branch Target Buffer (BTB) acts like a hashmap of previous branch destinations
Two-bit counters track the history of branch decisions
Both local (per-branch) and global branch history inform predictions

The accuracy is remarkable – in typical code, branch predictors achieve over 90% accuracy. This matters tremendously for performance.

Branch Prediction in Action: A Python Example

Here’s a counterintuitive demonstration. Consider this simple Python code that sums numbers and tracks those below 128:

total_sum = 0
sum_below_128 = 0
for num in numbers:  # numbers is a list of values 0-255
    total_sum += num
    if num < 128:
        sum_below_128 += num

In our initial naive testing with an unoptimized Python implementation, running this with 10 million random numbers took several seconds. But here’s the surprising part: sorting the numbers first made it significantly faster, even though we’re processing the exact same values. After optimizing our test harness (as shown in the benchmark table below), the effect is still clearly measurable though the absolute times are shorter. Why does this happen?

With random data, the branch predictor fails about 50% of the time – it can’t establish a pattern for the if num < 128 check. But with sorted data, it gets into a rhythm:

First half: Always true (numbers < 128)
Second half: Always false (numbers ≥ 128)

The CPU’s branch predictor learns this pattern, leading to dramatically fewer pipeline flushes and better performance.

Try It Yourself: Step-by-Step Guide

Let’s walk through how you can reproduce this fascinating branch prediction effect on your own machine:

Set up the environment First, make sure you have Python and the Linux perf tools installed:

# For Ubuntu/Debian
sudo apt-get update
sudo apt-get install linux-tools-common linux-tools-generic python3

# For Fedora/RHEL
sudo dnf install perf python3

Create the test script Create a file named branch_prediction.py:

import time
import argparse

def read_data(filename):
    with open(filename, 'r') as f:
        numbers = [int(x) for x in f.read().strip().split('\n')]
    return numbers

def process_numbers(data):
    # Run a few times to warm up the CPU
    for _ in range(3):
        total_sum = sum(data)
    
    total_sum = 0
    sum_below_128 = 0
    start_time = time.time()
    
    for num in data:
        total_sum += num
        if num < 128:
            sum_below_128 += num
            
    end_time = time.time()
    return total_sum, sum_below_128, end_time - start_time

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Test branch prediction effects')
    parser.add_argument('--input', type=str, required=True, help='Input file with numbers')
    parser.add_argument('--sorted', action='store_true', help='Flag to indicate sorted input')
    parser.add_argument('--size', type=int, default=10_000_000, help='Number of elements')
    args = parser.parse_args()

    # Read and process data
    numbers = read_data(args.input)
    total, below, elapsed = process_numbers(numbers)
    
    # Print results
    print(f"\nResults for {len(numbers):,} elements:")
    print(f"Time:          {elapsed:.3f} seconds")
    print(f"Total sum:     {total:,}")
    print(f"Sum < 128:     {below:,}")

Run the tests First, generate the test data:

# Generate unsorted data
shuf -i 0-255 -rn 10000000 > numbers_unsorted.txt

# Generate sorted data
sort -n numbers_unsorted.txt > numbers_sorted.txt

Then run the tests:

# Ensure the system is in a consistent state
sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'
sync
echo 3 | sudo tee /proc/sys/vm/drop_caches
sleep 2

# Run each test multiple times
for i in (seq 1 3)
  echo "Run $i - Unsorted data:"
  sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_unsorted.txt
  sleep 2

  echo "Run $i - Sorted data:"
  sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_sorted.txt
  sleep 2
end

Compare branch prediction stats Here are the benchmark results comparing sorted vs unsorted data performance:

Metric	Unsorted Data	Sorted Data	Improvement
Execution Time	0.404s	0.352s	12.9% faster
Total Sum	1,274,877,084	1,274,877,084	identical
Sum < 128	317,595,222	317,595,222	identical
Branches	4,742,990,120	4,782,560,106	similar
Branch Misses	16,080,100	1,438,836	91.1% fewer
Miss Rate	0.34%	0.03%	91.2% better
Total Time	1.57s	1.48s	5.7% faster
User Time	1.27s	1.18s	7.1% faster
System Time	0.30s	0.29s	3.3% faster

The results clearly demonstrate that:

Branch prediction is significantly more effective with sorted data
Miss rate dropped from 0.34% to just 0.03%
Overall execution time improved by 12.9%
The same computations (total sum and sum < 128) yield identical results

When comparing results, look for:

Branch miss rate (branch-misses/branches ratio)
Execution time differences
Consistency across runs

You should observe:

Unsorted data: ~40-50% branch miss rate
Sorted data: ~1-2% branch miss rate
Significantly faster execution with sorted data

Experiment further Try different data sizes and patterns:

# Test with different sizes
sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_sorted.txt --sorted --size 1000000
sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_unsorted.txt --size 1000000

# Test with partially sorted data (you'll need to create this file first)
# sort -n numbers_unsorted.txt | head -n 5000000 > numbers_partial.txt
# cat numbers_unsorted.txt | tail -n 5000000 >> numbers_partial.txt
python3 branch_prediction.py --input numbers_partial.txt --size 500000

Note: For the most accurate results:

Close other CPU-intensive applications
Run tests multiple times and average the results

Consider using CPU frequency scaling governor settings if available:

sudo cpupower frequency-set -g performance  # If cpupower is installed

Compiler Magic: From Branches to CMOV

When we move to C++, something even more interesting happens. Modern compilers are incredibly sophisticated and can transform our code in surprising ways. Let’s look at the same logic in C++:

int total_sum = 0;
int sum_below_128 = 0;
for(int num : numbers) {
    total_sum += num;
    if(num < 128)
        sum_below_128 += num;
}

Using the Compiler Explorer (godbolt.org), we can see that the compiler completely eliminates the branch! Instead, it:

Unconditionally calculates what sum_below_128 would be
Uses a conditional move (CMOV) instruction to either keep or discard the result

This is brilliant because CMOV doesn’t require branch prediction – it executes in constant time regardless of the condition. The compiler made a bet that the cost of always doing the addition is less than the cost of potential branch mispredictions.

Memory Access Patterns: The Real Performance Killer

While branch prediction is fascinating, memory access patterns often have an even bigger impact on performance. Modern CPUs have a hierarchy of caches:

Cache Level	Typical Size	Access Time
L1 Cache	32KB	~1ns
L2 Cache	256KB	~4ns
L3 Cache	Several MB	~10ns
Main Memory	Gigabytes	80-100ns

To demonstrate the impact, let’s look at traversing a linked list versus an array:

struct Node {
    int data;
    Node* next;
};

// Array traversal vs Linked List traversal
int sum_array(int* arr, size_t size) {
    int sum = 0;
    for(size_t i = 0; i < size; i++)
        sum += arr[i];
    return sum;
}

int sum_list(Node* head) {
    int sum = 0;
    while(head) {
        sum += head->data;
        head = head->next;
    }
    return sum;
}

With 10 million elements:

Array traversal: ~11ms
Sequential linked list: ~24ms
Random linked list: ~1.3 seconds!

The dramatic slowdown with random linked list traversal occurs because:

The CPU can’t predict the next memory access location
Each pointer dereference potentially requires a full memory access
Cache prefetching becomes ineffective

Case Study: Optimizing a Bloom Filter

A perfect example of how understanding CPU behavior can lead to better performance comes from implementing a Bloom filter. A Bloom filter is a probabilistic data structure that can tell you if an element is definitely not in a set or might be in a set. It’s commonly used as a quick rejection filter before expensive operations like disk or network access.

Here’s a key optimization in Bloom filter implementation that demonstrates CPU-level thinking:

// Version 1: Traditional approach with modulo
size_t get_bucket(size_t hash) {
    return hash % table_size;  // table_size is prime
}

// Version 2: Power-of-two optimization
size_t get_bucket(size_t hash) {
    return hash & (table_size - 1);  // table_size is power of 2
}

The first version using modulo seems natural - it’s what we typically do with hash tables. However, when we analyze the CPU instructions:

The modulo operation compiles to a division instruction
Division on modern CPUs takes 80-100 cycles
Division operations can’t be pipelined - only one can execute at a time

The second version replaces modulo with a bitwise AND operation. The tradeoff:

Slightly worse distribution of values (must use power of 2 size)
But dramatically faster execution (1-2 cycles vs 80-100 cycles)

Using LLVM’s machine code analyzer, we can see the first version taking ~148 cycles per iteration, while the optimized version takes just ~33 cycles. That’s a 4.5x improvement from understanding one CPU characteristic!

This optimization demonstrates a key principle: sometimes accepting slightly worse theoretical properties (like hash distribution) can lead to dramatically better real-world performance.

Modern Performance Analysis Tools

Understanding CPU behavior isn’t just theoretical – modern tools let us measure exactly what’s happening:

perf: Linux’s built-in performance analysis tool that can count CPU events
vtune: Intel’s comprehensive performance analyzer
llvm-mca: LLVM Machine Code Analyzer for static analysis
Compiler Explorer: Matt Godbolt’s invaluable tool for seeing compiler output

These tools can reveal:

Branch prediction success rates
Cache miss rates
Pipeline stalls
Instruction-level parallelism

Watch the Original Presentation

This article is based on Matt Godbolt’s excellent presentation “What Every Programmer Should Know about How CPUs Work” from GOTO 2024. You can watch an earlier version of this presentation here:

The specific GOTO 2024 recording will be linked here when it becomes publicly available.

The key lesson? CPUs are incredibly sophisticated, but also predictable once you understand their patterns. When writing performance-critical code, remember:

Profile before optimizing
Consider memory access patterns
Trust but verify compiler optimizations
Remember that predictable code often runs faster

Tools like perf, Compiler Explorer, and vtune make it easier than ever to understand how our code actually runs on modern hardware. While we can usually trust our compilers to make good decisions, understanding these concepts helps us write better code and solve performance problems when they arise.