Inside Modern CPUs: How Your Code Actually Executes
Modern CPUs are marvels of engineering that employ sophisticated techniques to execute our code efficiently. In this deep dive, based on Matt Godbolt’s presentation “What Every Programmer Should Know about How CPUs Work” from GOTO 2024 (January), we’ll explore how modern processors actually execute our code, revealing surprising optimizations that happen beneath the surface. Understanding these concepts can help us write more performant code and debug mysterious performance issues.
The Modern CPU Pipeline: Not Your University Textbook
The traditional “fetch-decode-execute-retire” pipeline you learned in computer architecture class? That’s ancient history. Modern CPUs employ a much more sophisticated approach:
In-order Front End: Instructions are fetched, decoded, and fed into the execution pipeline sequentially as they appear in the program. This front end is responsible for instruction decoding, branch prediction, and feeding the execution units.
Out-of-order Execution Engine: Once decoded, instructions enter a complex scheduling system that analyzes dependencies and executes instructions when their inputs are ready, not necessarily in program order. This allows the CPU to work around stalls by executing independent instructions that come later in the program while waiting for data or memory operations.
In-order Retirement: Although execution happens out of order, results must be committed to architectural state (registers and memory) in the original program order to maintain correct behavior. This component ensures that speculative execution can be rolled back if branch predictions were wrong.
This architecture allows for remarkable performance improvements through parallel execution while maintaining the illusion of sequential processing. The genius of this design is that programmers can write straightforward sequential code while the CPU handles complex parallelization automatically. Let’s dive deeper into how this works.
Branch Prediction: The CPU’s Crystal Ball
Before even starting to execute your code, modern CPUs try to predict which instructions they’ll need next. Why? Because the pipeline is dozens of cycles deep, and waiting to know for sure would waste precious time. It’s like a mining conveyor belt that’s half a mile long – you can’t wait until the end to change direction.
The branch predictor uses sophisticated techniques to guess:
- A Branch Target Buffer (BTB) acts like a hashmap of previous branch destinations
- Two-bit counters track the history of branch decisions
- Both local (per-branch) and global branch history inform predictions
The accuracy is remarkable – in typical code, branch predictors achieve over 90% accuracy. This matters tremendously for performance.
Branch Prediction in Action: A Python Example
Here’s a counterintuitive demonstration. Consider this simple Python code that sums numbers and tracks those below 128:
total_sum = 0
sum_below_128 = 0
for num in numbers: # numbers is a list of values 0-255
total_sum += num
if num < 128:
sum_below_128 += num
In our initial naive testing with an unoptimized Python implementation, running this with 10 million random numbers took several seconds. But here’s the surprising part: sorting the numbers first made it significantly faster, even though we’re processing the exact same values. After optimizing our test harness (as shown in the benchmark table below), the effect is still clearly measurable though the absolute times are shorter. Why does this happen?
With random data, the branch predictor fails about 50% of the time – it can’t establish a pattern for the if num < 128
check. But with sorted data, it gets into a rhythm:
- First half: Always true (numbers < 128)
- Second half: Always false (numbers ≥ 128)
The CPU’s branch predictor learns this pattern, leading to dramatically fewer pipeline flushes and better performance.
Try It Yourself: Step-by-Step Guide
Let’s walk through how you can reproduce this fascinating branch prediction effect on your own machine:
Set up the environment First, make sure you have Python and the Linux perf tools installed:
# For Ubuntu/Debian sudo apt-get update sudo apt-get install linux-tools-common linux-tools-generic python3 # For Fedora/RHEL sudo dnf install perf python3
Create the test script Create a file named
branch_prediction.py
:
import time
import argparse
def read_data(filename):
with open(filename, 'r') as f:
numbers = [int(x) for x in f.read().strip().split('\n')]
return numbers
def process_numbers(data):
# Run a few times to warm up the CPU
for _ in range(3):
total_sum = sum(data)
total_sum = 0
sum_below_128 = 0
start_time = time.time()
for num in data:
total_sum += num
if num < 128:
sum_below_128 += num
end_time = time.time()
return total_sum, sum_below_128, end_time - start_time
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Test branch prediction effects')
parser.add_argument('--input', type=str, required=True, help='Input file with numbers')
parser.add_argument('--sorted', action='store_true', help='Flag to indicate sorted input')
parser.add_argument('--size', type=int, default=10_000_000, help='Number of elements')
args = parser.parse_args()
# Read and process data
numbers = read_data(args.input)
total, below, elapsed = process_numbers(numbers)
# Print results
print(f"\nResults for {len(numbers):,} elements:")
print(f"Time: {elapsed:.3f} seconds")
print(f"Total sum: {total:,}")
print(f"Sum < 128: {below:,}")
Run the tests First, generate the test data:
# Generate unsorted data shuf -i 0-255 -rn 10000000 > numbers_unsorted.txt # Generate sorted data sort -n numbers_unsorted.txt > numbers_sorted.txt
Then run the tests:
# Ensure the system is in a consistent state sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid' sync echo 3 | sudo tee /proc/sys/vm/drop_caches sleep 2 # Run each test multiple times for i in (seq 1 3) echo "Run $i - Unsorted data:" sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_unsorted.txt sleep 2 echo "Run $i - Sorted data:" sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_sorted.txt sleep 2 end
Compare branch prediction stats Here are the benchmark results comparing sorted vs unsorted data performance:
Metric Unsorted Data Sorted Data Improvement Execution Time 0.404s 0.352s 12.9% faster Total Sum 1,274,877,084 1,274,877,084 identical Sum < 128 317,595,222 317,595,222 identical Branches 4,742,990,120 4,782,560,106 similar Branch Misses 16,080,100 1,438,836 91.1% fewer Miss Rate 0.34% 0.03% 91.2% better Total Time 1.57s 1.48s 5.7% faster User Time 1.27s 1.18s 7.1% faster System Time 0.30s 0.29s 3.3% faster The results clearly demonstrate that:
- Branch prediction is significantly more effective with sorted data
- Miss rate dropped from 0.34% to just 0.03%
- Overall execution time improved by 12.9%
- The same computations (total sum and sum < 128) yield identical results
When comparing results, look for:
- Branch miss rate (branch-misses/branches ratio)
- Execution time differences
- Consistency across runs
You should observe:
- Unsorted data: ~40-50% branch miss rate
- Sorted data: ~1-2% branch miss rate
- Significantly faster execution with sorted data
Experiment further Try different data sizes and patterns:
# Test with different sizes sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_sorted.txt --sorted --size 1000000 sudo perf stat -e branches,branch-misses python3 branch_prediction.py --input numbers_unsorted.txt --size 1000000 # Test with partially sorted data (you'll need to create this file first) # sort -n numbers_unsorted.txt | head -n 5000000 > numbers_partial.txt # cat numbers_unsorted.txt | tail -n 5000000 >> numbers_partial.txt python3 branch_prediction.py --input numbers_partial.txt --size 500000
Note: For the most accurate results:
- Close other CPU-intensive applications
- Run tests multiple times and average the results
- Consider using CPU frequency scaling governor settings if available:
sudo cpupower frequency-set -g performance # If cpupower is installed
Compiler Magic: From Branches to CMOV
When we move to C++, something even more interesting happens. Modern compilers are incredibly sophisticated and can transform our code in surprising ways. Let’s look at the same logic in C++:
int total_sum = 0;
int sum_below_128 = 0;
for(int num : numbers) {
total_sum += num;
if(num < 128)
sum_below_128 += num;
}
Using the Compiler Explorer (godbolt.org), we can see that the compiler completely eliminates the branch! Instead, it:
- Unconditionally calculates what sum_below_128 would be
- Uses a conditional move (CMOV) instruction to either keep or discard the result
This is brilliant because CMOV doesn’t require branch prediction – it executes in constant time regardless of the condition. The compiler made a bet that the cost of always doing the addition is less than the cost of potential branch mispredictions.
Memory Access Patterns: The Real Performance Killer
While branch prediction is fascinating, memory access patterns often have an even bigger impact on performance. Modern CPUs have a hierarchy of caches:
Cache Level | Typical Size | Access Time |
---|---|---|
L1 Cache | 32KB | ~1ns |
L2 Cache | 256KB | ~4ns |
L3 Cache | Several MB | ~10ns |
Main Memory | Gigabytes | 80-100ns |
To demonstrate the impact, let’s look at traversing a linked list versus an array:
struct Node {
int data;
Node* next;
};
// Array traversal vs Linked List traversal
int sum_array(int* arr, size_t size) {
int sum = 0;
for(size_t i = 0; i < size; i++)
sum += arr[i];
return sum;
}
int sum_list(Node* head) {
int sum = 0;
while(head) {
sum += head->data;
head = head->next;
}
return sum;
}
With 10 million elements:
- Array traversal: ~11ms
- Sequential linked list: ~24ms
- Random linked list: ~1.3 seconds!
The dramatic slowdown with random linked list traversal occurs because:
- The CPU can’t predict the next memory access location
- Each pointer dereference potentially requires a full memory access
- Cache prefetching becomes ineffective
Case Study: Optimizing a Bloom Filter
A perfect example of how understanding CPU behavior can lead to better performance comes from implementing a Bloom filter. A Bloom filter is a probabilistic data structure that can tell you if an element is definitely not in a set or might be in a set. It’s commonly used as a quick rejection filter before expensive operations like disk or network access.
Here’s a key optimization in Bloom filter implementation that demonstrates CPU-level thinking:
// Version 1: Traditional approach with modulo
size_t get_bucket(size_t hash) {
return hash % table_size; // table_size is prime
}
// Version 2: Power-of-two optimization
size_t get_bucket(size_t hash) {
return hash & (table_size - 1); // table_size is power of 2
}
The first version using modulo seems natural - it’s what we typically do with hash tables. However, when we analyze the CPU instructions:
- The modulo operation compiles to a division instruction
- Division on modern CPUs takes 80-100 cycles
- Division operations can’t be pipelined - only one can execute at a time
The second version replaces modulo with a bitwise AND operation. The tradeoff:
- Slightly worse distribution of values (must use power of 2 size)
- But dramatically faster execution (1-2 cycles vs 80-100 cycles)
Using LLVM’s machine code analyzer, we can see the first version taking ~148 cycles per iteration, while the optimized version takes just ~33 cycles. That’s a 4.5x improvement from understanding one CPU characteristic!
This optimization demonstrates a key principle: sometimes accepting slightly worse theoretical properties (like hash distribution) can lead to dramatically better real-world performance.
Modern Performance Analysis Tools
Understanding CPU behavior isn’t just theoretical – modern tools let us measure exactly what’s happening:
- perf: Linux’s built-in performance analysis tool that can count CPU events
- vtune: Intel’s comprehensive performance analyzer
- llvm-mca: LLVM Machine Code Analyzer for static analysis
- Compiler Explorer: Matt Godbolt’s invaluable tool for seeing compiler output
These tools can reveal:
- Branch prediction success rates
- Cache miss rates
- Pipeline stalls
- Instruction-level parallelism
Watch the Original Presentation
This article is based on Matt Godbolt’s excellent presentation “What Every Programmer Should Know about How CPUs Work” from GOTO 2024. You can watch an earlier version of this presentation here:
The specific GOTO 2024 recording will be linked here when it becomes publicly available.
The key lesson? CPUs are incredibly sophisticated, but also predictable once you understand their patterns. When writing performance-critical code, remember:
- Profile before optimizing
- Consider memory access patterns
- Trust but verify compiler optimizations
- Remember that predictable code often runs faster
Tools like perf, Compiler Explorer, and vtune make it easier than ever to understand how our code actually runs on modern hardware. While we can usually trust our compilers to make good decisions, understanding these concepts helps us write better code and solve performance problems when they arise.
Further Reading
If you found this article interesting and want to learn more, here are some excellent resources:
- Compiler Explorer - Matt Godbolt’s tool for exploring assembly output of compilers
- Matt Godbolt’s GitHub - Projects and repositories by the creator of Compiler Explorer
- What Every Programmer Should Know About Memory - Ulrich Drepper’s comprehensive guide to memory systems
- Intel® VTune™ Profiler - Documentation for Intel’s performance analysis tool
- LLVM Machine Code Analyzer - Documentation for static analysis of machine code
- Linux perf Examples - Brendan Gregg’s guide to using perf
- A Survey of Techniques for Improving Branch Prediction - Academic paper on branch prediction strategies