Compiler Strategies for Cache- and Memory-Friendly Design Across CPUs and GPUs

Oct 18, 2025

Overview

Modern compilers are increasingly memory- and cache-aware, as performance is no longer limited by raw FLOPs (floating-point operations per second) but by data movement and cache locality. CPUs and GPUs expose radically different memory hierarchies, and compiler strategies must adapt to each while still leveraging shared principles of locality, reuse, and predictability.

This note outlines:

  1. General cache and memory friendliness rules for modern CPUs.
  2. Equivalent performance rules for modern GPUs.
  3. A hybrid compiler design that targets both architectures from a single intermediate representation (IR).

1. Cache- and Memory-Friendliness Rules for CPUs

Modern CPUs (Central Processing Units) are latency-oriented and hierarchical. Typical cache layers include:

LevelTypical SizeAccess TimeNotes
L132–64 KB~1 nsSplit into instruction (I) and data (D) caches
L2256–2048 KB~4 nsUnified per core
L34–64 MB~10–20 nsShared across cores
DRAMGBs~80–120 nsMain memory

1.1 Universal Compiler Principles for CPUs

PrincipleExplanationBenefit
Spatial LocalityAccess contiguous memory (arrays, SoA layouts)Fewer cache-line fills
Temporal LocalityReuse recently accessed dataHigher cache hit rate
Hot Code CompactnessKeep hot loops < 8–16 KB to fit in L1I (Instruction Cache)Fewer ICACHE misses
Predictable BranchingFavor likely branches as fallthroughImproves branch predictor accuracy
Prefetch-Friendly AccessStride-1 loops and predictable indexingEnables hardware prefetchers
Loop Tiling / BlockingPartition loops so working sets fit in cacheReduces cache thrash
Data Structure LayoutUse Structure of Arrays (SoA) for vectorized opsImproves SIMD (Single Instruction Multiple Data) efficiency
Quantization and PackingCompress frequently accessed values (e.g., thresholds)Better cache density

Compiler Techniques:

  • Polyhedral loop optimization and cache tiling (e.g., MLIR affine dialect)
  • Function inlining guided by instruction cache budget
  • Profile-guided reordering of hot functions
  • Prefetch insertion (software prefetch or hardware hinting)

References:

  • Drepper, What Every Programmer Should Know About Memory (2007)
  • Fog, Instruction Tables and Cache Optimizations (2018)
  • Intel Optimization Manual, vol. 1–3 (2024)

2. Memory Hierarchies and Rules for GPUs

GPUs (Graphics Processing Units) are throughput-oriented, designed for massive parallelism and bandwidth utilization rather than latency minimization. A simplified hierarchy:

Memory TypeLatencyScopeAccessNotes
Registers~1 cyclePer-threadFastestHolds scalars and temporaries
Shared Memory (L1)20–30 nsPer-blockProgrammable SRAMManually managed cache
L2 Cache~100 nsAll SMsAutomaticShared across chip
Global Memory (VRAM/HBM)300–800 nsDeviceCoalesced access needed
Host Memory (CPU RAM)µsAcross PCIe/NVLinkVery high latency

2.1 Universal GPU Compiler Rules

PrincipleExplanationBenefit
Memory CoalescingAdjacent threads in a warp access consecutive addressesFull memory transaction utilization
Shared Memory ReuseTile reusable data into shared memoryReduces global memory traffic
Warp UniformityAll threads in a warp follow same control pathPrevents divergence serialization
Predication over BranchingReplace if/else with arithmetic masksKeeps SIMD lanes active
High OccupancyUse few registers & shared memory per threadHides latency by overlapping warps
Compute–Memory BalanceIncrease arithmetic intensity (FLOPs/byte)Hides memory latency
Explicit TilingUse kernel-level loop tiling & block synchronizationImproves data reuse and scheduling

Compiler Techniques:

  • Kernel fusion and loop tiling (e.g., MLIR’s GPU dialect)
  • Thread-block scheduling with register pressure control
  • Warp-level predication lowering
  • Autotuning for launch geometry (grid, block, warp)

References:

  • NVIDIA CUDA Best Practices Guide (2025)
  • AMD ROCm Optimization Guide (2024)
  • Volkov, Understanding Latency Hiding on GPUs (2016)

3. Hybrid Compiler Design: Unified IR for CPUs and GPUs

The goal: a single intermediate representation (IR) that expresses control flow, memory access, and parallel structure so it can be lowered to both CPU and GPU backends efficiently.

3.1 Core Design

A unified IR should model computation as structured regions with explicit data dependencies:

forest.apply %X {
  tree.begin %t0
    %n0 = node.split.num  fid=3, thr=0.72
    br_if %n0, label %L1, %L2
  %L1: node.leaf  val=0.13
  %L2: node.leaf  val=-0.02
  tree.end
}

This high-level control flow can be specialized differently per backend.

3.2 Lowering Strategies

PhaseCPU LoweringGPU Lowering
Control FlowEmit branchy code; flatten hot pathsUse predicated arithmetic; avoid divergence
ParallelismVectorize (AVX512/SVE) across samplesMap threads to (sample, tree) grid
Memory LayoutSoA (Structure of Arrays)Coalesced global layout + shared tiles
Quantizationi16 thresholds for cache densitySame quantization reused for bandwidth saving
Runtime API (Application Programming Interface)void predict_cpu(X, Y)__global__ void predict_gpu(X, Y)

3.3 Scheduling Knobs

ParameterDescriptionTarget Effect
tile_sizeLoop tile sizeCache fit (CPU) / Shared mem fit (GPU)
packet_widthSIMD (Single Instruction Multiple Data) widthVector utilization
grid/blockLaunch geometryOccupancy, latency hiding
quant_bitsQuantization level (8/12/16)Cache or bandwidth tradeoff

3.4 Autotuning Layer

An autotuner can search for optimal (tile, packet, block, quant) given hardware profiles:

[ S^* = \arg\max_{S \in \text{ScheduleSpace}} \text{Throughput}(S, \text{arch}) ]

Similar frameworks exist in TVM, IREE, and Halide.


4. Converging Trends

  1. Hybrid IRs (Intermediate Representations) — MLIR’s multi-dialect model enables CPU/GPU lowering from a shared graph.
  2. Hardware-intent inference — compilers infer data reuse and control-flow divergence statically.
  3. Autotuning as compilation — schedule search is now part of the compiler pipeline.
  4. Quantization-unified optimization — same quantized model runs efficiently across heterogeneous targets.

References

  1. Ulrich Drepper, What Every Programmer Should Know About Memory, Red Hat, 2007.
  2. Intel Corporation, Optimization Reference Manual, 2024.
  3. Agner Fog, Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-operation Counts, 2018.
  4. Vasily Volkov, Understanding Latency Hiding on GPUs, UC Berkeley, 2016.
  5. NVIDIA Corporation, CUDA Best Practices Guide, 2025.
  6. AMD ROCm Developer Tools, ROCm Optimization Guide, 2024.
  7. MLIR Project, Affine and GPU Dialects, LLVM.org, 2025.
  8. Chen et al., TVM: End-to-End Optimization Stack for Deep Learning, OSDI 2018.
  9. Ragan-Kelley et al., Halide: Decoupling Algorithms from Schedules, CACM 2017.
  10. Google IREE Team, Unified Compiler Infrastructure for ML on CPUs, GPUs, and Accelerators, 2023.
https://josephbak.github.io/posts/feed.xml