Compiler Strategies for Cache- and Memory-Friendly Design Across CPUs and GPUs

Oct 18, 2025

#compilers #cpu #gpu #cache-optimization #memory-hierarchy #systems-design

Overview

Modern compilers are increasingly memory- and cache-aware, as performance is no longer limited by raw FLOPs (floating-point operations per second) but by data movement and cache locality. CPUs and GPUs expose radically different memory hierarchies, and compiler strategies must adapt to each while still leveraging shared principles of locality, reuse, and predictability.

This note outlines:

General cache and memory friendliness rules for modern CPUs.
Equivalent performance rules for modern GPUs.
A hybrid compiler design that targets both architectures from a single intermediate representation (IR).

1. Cache- and Memory-Friendliness Rules for CPUs

Modern CPUs (Central Processing Units) are latency-oriented and hierarchical. Typical cache layers include:

Level	Typical Size	Access Time	Notes
L1	32–64 KB	~1 ns	Split into instruction (I) and data (D) caches
L2	256–2048 KB	~4 ns	Unified per core
L3	4–64 MB	~10–20 ns	Shared across cores
DRAM	GBs	~80–120 ns	Main memory

1.1 Universal Compiler Principles for CPUs

Principle	Explanation	Benefit
Spatial Locality	Access contiguous memory (arrays, SoA layouts)	Fewer cache-line fills
Temporal Locality	Reuse recently accessed data	Higher cache hit rate
Hot Code Compactness	Keep hot loops < 8–16 KB to fit in L1I (Instruction Cache)	Fewer ICACHE misses
Predictable Branching	Favor likely branches as fallthrough	Improves branch predictor accuracy
Prefetch-Friendly Access	Stride-1 loops and predictable indexing	Enables hardware prefetchers
Loop Tiling / Blocking	Partition loops so working sets fit in cache	Reduces cache thrash
Data Structure Layout	Use Structure of Arrays (SoA) for vectorized ops	Improves SIMD (Single Instruction Multiple Data) efficiency
Quantization and Packing	Compress frequently accessed values (e.g., thresholds)	Better cache density

Compiler Techniques:

Polyhedral loop optimization and cache tiling (e.g., MLIR affine dialect)
Function inlining guided by instruction cache budget
Profile-guided reordering of hot functions
Prefetch insertion (software prefetch or hardware hinting)

References:

Drepper, What Every Programmer Should Know About Memory (2007)
Fog, Instruction Tables and Cache Optimizations (2018)
Intel Optimization Manual, vol. 1–3 (2024)

2. Memory Hierarchies and Rules for GPUs

GPUs (Graphics Processing Units) are throughput-oriented, designed for massive parallelism and bandwidth utilization rather than latency minimization. A simplified hierarchy:

Memory Type	Latency	Scope	Access	Notes
Registers	~1 cycle	Per-thread	Fastest	Holds scalars and temporaries
Shared Memory (L1)	20–30 ns	Per-block	Programmable SRAM	Manually managed cache
L2 Cache	~100 ns	All SMs	Automatic	Shared across chip
Global Memory (VRAM/HBM)	300–800 ns	Device	Coalesced access needed
Host Memory (CPU RAM)	µs	Across PCIe/NVLink	Very high latency

2.1 Universal GPU Compiler Rules

Principle	Explanation	Benefit
Memory Coalescing	Adjacent threads in a warp access consecutive addresses	Full memory transaction utilization
Shared Memory Reuse	Tile reusable data into shared memory	Reduces global memory traffic
Warp Uniformity	All threads in a warp follow same control path	Prevents divergence serialization
Predication over Branching	Replace if/else with arithmetic masks	Keeps SIMD lanes active
High Occupancy	Use few registers & shared memory per thread	Hides latency by overlapping warps
Compute–Memory Balance	Increase arithmetic intensity (FLOPs/byte)	Hides memory latency
Explicit Tiling	Use kernel-level loop tiling & block synchronization	Improves data reuse and scheduling

Compiler Techniques:

Kernel fusion and loop tiling (e.g., MLIR’s GPU dialect)
Thread-block scheduling with register pressure control
Warp-level predication lowering
Autotuning for launch geometry (grid, block, warp)

References:

NVIDIA CUDA Best Practices Guide (2025)
AMD ROCm Optimization Guide (2024)
Volkov, Understanding Latency Hiding on GPUs (2016)

3. Hybrid Compiler Design: Unified IR for CPUs and GPUs

The goal: a single intermediate representation (IR) that expresses control flow, memory access, and parallel structure so it can be lowered to both CPU and GPU backends efficiently.

3.1 Core Design

A unified IR should model computation as structured regions with explicit data dependencies:

forest.apply %X {
  tree.begin %t0
    %n0 = node.split.num  fid=3, thr=0.72
    br_if %n0, label %L1, %L2
  %L1: node.leaf  val=0.13
  %L2: node.leaf  val=-0.02
  tree.end
}

This high-level control flow can be specialized differently per backend.

3.2 Lowering Strategies

Phase	CPU Lowering	GPU Lowering
Control Flow	Emit branchy code; flatten hot paths	Use predicated arithmetic; avoid divergence
Parallelism	Vectorize (AVX512/SVE) across samples	Map threads to (sample, tree) grid
Memory Layout	SoA (Structure of Arrays)	Coalesced global layout + shared tiles
Quantization	i16 thresholds for cache density	Same quantization reused for bandwidth saving
Runtime API (Application Programming Interface)	`void predict_cpu(X, Y)`	`__global__ void predict_gpu(X, Y)`

3.3 Scheduling Knobs

Parameter	Description	Target Effect
`tile_size`	Loop tile size	Cache fit (CPU) / Shared mem fit (GPU)
`packet_width`	SIMD (Single Instruction Multiple Data) width	Vector utilization
`grid/block`	Launch geometry	Occupancy, latency hiding
`quant_bits`	Quantization level (8/12/16)	Cache or bandwidth tradeoff

3.4 Autotuning Layer

An autotuner can search for optimal (tile, packet, block, quant) given hardware profiles:

[ S^* = \arg\max_{S \in \text{ScheduleSpace}} \text{Throughput}(S, \text{arch}) ]

Similar frameworks exist in TVM, IREE, and Halide.

4. Converging Trends

Hybrid IRs (Intermediate Representations) — MLIR’s multi-dialect model enables CPU/GPU lowering from a shared graph.
Hardware-intent inference — compilers infer data reuse and control-flow divergence statically.
Autotuning as compilation — schedule search is now part of the compiler pipeline.
Quantization-unified optimization — same quantized model runs efficiently across heterogeneous targets.

References

Ulrich Drepper, What Every Programmer Should Know About Memory, Red Hat, 2007.
Intel Corporation, Optimization Reference Manual, 2024.
Agner Fog, Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-operation Counts, 2018.
Vasily Volkov, Understanding Latency Hiding on GPUs, UC Berkeley, 2016.
NVIDIA Corporation, CUDA Best Practices Guide, 2025.
AMD ROCm Developer Tools, ROCm Optimization Guide, 2024.
MLIR Project, Affine and GPU Dialects, LLVM.org, 2025.
Chen et al., TVM: End-to-End Optimization Stack for Deep Learning, OSDI 2018.
Ragan-Kelley et al., Halide: Decoupling Algorithms from Schedules, CACM 2017.
Google IREE Team, Unified Compiler Infrastructure for ML on CPUs, GPUs, and Accelerators, 2023.