Speculative Duplication and Cost Modeling in MLIR
1. Overview
Speculative duplication duplicates part of a computation—intentionally—to unlock a more profitable transformation (e.g., fusion or vectorization) on a hot path, guided by a cost model. In MLIR (Multi-Level Intermediate Representation), a canonical example is enabling FMA (Fused Multiply-Add) contraction when a producer operation has multiple consumers.
2. Running Example (Floating-Point)
Baseline IR (Intermediate Representation)
%y = arith.mulf %x, %c1
%tmp = arith.subf %y, %c3
%z = arith.addf %y, %c2
We would like to contract %y = x*c1
and the addf
into a single arith.fma %x, %c1, %c2
for %z
.
However, %y
has multiple uses (%tmp
and %z
), so replacing %y
would change %tmp
’s semantics—illegal under SSA (Static Single Assignment).
Speculative Duplication Plan
%y = arith.mulf %x, %c1 // kept for %tmp
%z = arith.fma %x, %c1, %c2 // duplicate multiply inside fma
%tmp = arith.subf %y, %c3
- Correctness is preserved (each consumer gets the right value).
%z
’s path benefits from FMA’s shorter dependency chain.
3. Legality Preconditions (Floating-Point)
- Contraction allowed:
arith.fastmath
must includecontract
; otherwisemulf+addf → fma
is not IEEE (Institute of Electrical and Electronics Engineers)-legal (two roundings vs one). - No side effects between producer and consumer.
- Target must support real FMA; otherwise the rewrite regresses to
mul+add
.
Examples:
%z = arith.addf %y, %c2 {fastmath = [contract]}
func.func @f() attributes { arith.fastmath = [contract, reassoc] } { ... }
4. Why the Naive “Total Work” View Fails
A naive comparison sums operation costs:
$$ C_{add} > C_{fma} + P_{dup}, $$
which seems impossible because usually $C_{fma} \ge C_{add}$.
That view assumes all work is serialized—counting every instruction as if they block each other.
What Actually Matters
Only the critical-path latency to produce %z
affects performance.
After duplication, the extra multiply for %tmp
is independent of %z
’s computation—it runs on another path or execution unit. Its cost should not be counted serially.
5. Path-Aware Model
Let $L_{op}$ denote latency (or an effective throughput proxy) for an operation on the %z
path.
Baseline (no duplication): critical path to %z
is mulf → addf
,
$$
L_{baseline} = L_{mul} + L_{add}.
$$
With speculative duplication: critical path to %z
is fma
plus any path penalty,
$$
L_{dup} = L_{fma} + P_{dup, path}.
$$
Profitability condition (latency-bound): $$ \boxed{L_{mul} + L_{add} > L_{fma} + P_{dup, path}} $$
- $P_{dup, path}$ = penalty that actually affects
%z
’s schedule (e.g., spill caused by higher pressure). - The multiply for
%tmp
exists in both cases and is off%z
’s path; it should not be added to%z
’s serial latency.
6. Visual Intuition for Path Latency
(a) Baseline — sequential dependency chain
c1 c2
│ │
x ───▶ mulf ───▶ addf ───▶ z
%z
depends on bothmulf
andaddf
sequentially.- Latency to produce
%z
: L_mul + L_add (fully serial).
(b) After speculative duplication — independent paths
┌──────────▶ subi ───▶ tmp
x ──▶ mulf ─┘
│
└──────────▶ fma ───▶ z
%z
now depends only onfma
.%tmp
depends on the oldmulf
(runs in parallel).- Latency to produce
%z
: L_fma + P_dup,path, which is much shorter.
(c) Timing timeline
Baseline:
Time → [ mulf ][ addf ] → z ready at cycle 8
With duplication:
Time → [ mulf ][ subi ] → tmp ready at 8
[ fma ] → z ready at 4
- More total instructions executed overall.
- But
%z
finishes earlier, shortening the critical path.
(d) Concept summary
View | Measures | %z ready time | Misleading? |
---|---|---|---|
Total work | Counts all instructions | Same or higher | Yes — ignores parallelism |
Path latency | Longest dependency chain to %z | Lower | Correct |
7. Typical Numbers (FP32)
-
x86 AVX-512: $L_{mul} \approx 3\text{–}4$, $L_{add} \approx 3\text{–}4$, $L_{fma} \approx 4$.
Inequality holds if $P_{dup, path} < 2$ cycles. -
NVIDIA GPU (FP32 core): $L_{mul} \approx 4\text{–}6$, $L_{add} \approx 4\text{–}6$, $L_{fma} \approx 4\text{–}6$.
Often equal; still beneficial when shortening dependency chains aids scheduling. -
ARM/Apple: $L_{fma} \approx L_{mul}$ and both are small; inequality often holds for modest $P_{dup, path}$.
8. Implementation Sketch
Matching
- Match
mulf
feedingaddf
producing%z
. %y
has multiple uses.fastmath
includescontract
.- Target supports FMA (Fused Multiply-Add).
Rewriting
;
9. When It’s Worth It
- Latency-bound regions: critical-path shortening dominates extra work.
- Hardware FMA available: FMA throughput ≈ MUL throughput.
- Moderate register pressure: duplication penalty small.
- Fast-math with contraction: legally allowed transformation.
When It Isn’t
- Throughput-bound code with wide ILP (Instruction Level Parallelism).
- High register pressure or spilling (large $P_{dup, path}$).
- Targets that lower FMA (Fused Multiply-Add) back into
mul+add
. - Strict IEEE (Institute of Electrical and Electronics Engineers)-754 without
contract
.
10. Summary
- Goal: Improve latency on a hot path by fusing multiply and add via duplication.
- Mechanism: Duplicate the multiply so one consumer can form an
fma
. - Decision rule: apply if $$ L_{mul} + L_{add} > L_{fma} + P_{dup, path}. $$
- Key idea: judge by critical-path latency, not total instruction count.
- Legality: requires
fastmath=contract
and hardware FMA.
11. Takeaway
Speculative duplication in MLIR (Multi-Level Intermediate Representation) is not about doing less work—it is about restructuring dependency graphs to shorten critical latency paths. The compiler duplicates an operation only when a path-aware cost model predicts that the shorter dependency chain outweighs any overhead from added instructions, register pressure, or scheduling noise.