(Or: How fast should my model be running?)
Marshall Ward
NOAA-GFDL
March 1, 2022
\[W_\text{FLOPS} = W_\text{core} \times N_\text{cores} \times f_\text{cycles}\]
Platform | Theory (GF/s) | Observed |
---|---|---|
x86 (Cascade Lake) | 2150.4 | 2145.3 |
Nvidia (Volta) | 7065.6 | 7007.3 |
But are these numbers achievable?
24 core × 8 (AVX512) × 2 (FMA) × 2 port × 2.8 GHz
80 SM × 32 FP64 core × 2 (FMA) × 1.38 GHz
Platform | FLOPs | Year | Price | TDP(W) |
---|---|---|---|---|
Cascade Lake | 2145.3 | 2019 | $6432 | 240 |
Volta (GPU) | 7007.3 | 2018 | $8999 | 250 |
Intel(R) Xeon(R) Platinum 8274 CPU @ 3.20GHz
Nvidia Volta GV100 (1.38GHz)
Operation | FLOPs/byte | AI |
\(\text{C}_{ij} = \sum_{k} A_{ik} B_{kj}\) | \(\frac{2n^3}{3n^2 \times \texttt{sizeof}(A)}\) | \(\frac{1}{12}n\) |
\(\phi_{i} = \phi_{i} + \Delta t \ \mathcal{F}_{i}\) | \(\frac{2n}{3n \times \texttt{sizeof} (\phi)}\) | \(\frac{1}{12}\) |
Array computation is proportional to bandwidth
Most solvers are hyperbolic and elliptic, e.g.:
\[\frac{\partial \mathbf{u}}{\partial t} = - \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[\nabla^2 p = -\nabla \left( \mathbf{u} \cdot \nabla \mathbf{u} \right) - \nabla \cdot \mathcal{F}\]
Pressure (Montgomery potential, PV, ...) solver is a (sparse) matrix operation.
Hydrostatic balance eliminates elliptic step:
\[\frac{\partial \mathbf{u}}{\partial t} = - \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[p = -g \rho\]\[w \sim \frac{\partial \rho}{\partial t} = ...\]
Many (but not all) climate ocean models are hyperbolic
\(W_\text{FLOPS} \leq I \times B_\text{mem}\)
int main() {
/* ... */
start(timer);
for (r = 0; r < r_max; r++) {
for (int i = 0; i < n; i++)
kernel(i, a, b, x, y);
// An impossible branch to block loop interchange
if (y[0] < 0.) dummy(a, b, x, y);
}
stop(timer);
}
void kernel(int i, double a, double b, double *x, double *y) {
y[i] = a * x[i] + y[i];
}
Apply artificially high iteration, but let the compiler construct the kernel.
\[W = \text{min}(W_\text{max}, I \times B_\text{mem})\]
For our codes, often \(W = I \times B_\text{mem}\)
Platform | BW (GB/s) |
---|---|
x86 (cache) | 8600 |
x86 (RAM) | 140 |
Volta (RAM) | 900 |
Cache BW is across all cores!
Platform | Speed (GHz) | Width (B) | BW (GB/s) |
---|---|---|---|
x86 cache | 2.8* | 128* | 8602 |
x86 RAM | 2.933 | 8 × 6 | 141 |
Volta RAM | 877 | 1024 | 898 |
Three 64B + AGU ports, but only 2 AGU are usable
\[\phi^{n+1}_i = \phi^n_i + \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i \qquad I = \frac{2}{3\times 8} = \frac{1}{12}\]
y[i] = y[i] + a * x[i]
Platform | Theory (GF/s) | Observed | % |
---|---|---|---|
x86 (Cache) | 716.8 | 715.1 | 99.7 |
x86 (RAM) | 11.7 | 9.5 | 80.8 |
GPU | 74.8 | 67.3 | 90.0 |
RAM efficiencies of 80-90% are common
\[\phi^{n+1}_i = \phi^n_i + \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i\]
y[i] = y[i] + a * x[i]
2 FLOPs per 3×8 bytes: \(\frac{1}{12}\)
Platform | Observed | Theory | % |
---|---|---|---|
x86 (Cache) | 715.1 | 716.8 | 99.7 |
x86 (RAM) | 9.5 | 11.7 | 81.2 |
GPU | 67.3 | 74.8 | 90.0 |
RAM efficiencies of 80-90% are common
\[\phi_{i+\tfrac{1}{2}} = \frac{1}{2} \left( \phi_i + \phi_{i+1} \right) \qquad I = \frac{2}{2\times 8} = \frac{1}{8}\]
y[i] = a * (x[i] + x[i+1])
Platform | Theory (GF/s) | Observed | % |
---|---|---|---|
x86 (Cache) | 1075.3 | 781.4 | 72.7 |
x86 (RAM) | 17.6 | 9.4 | 53.4 |
GPU | 112.3 | 72.4 | 64.5 |
Why are estimates so much lower?
Unlike Euler step, target is not on RHS:
y[i] = a * x[i] + a * x[i+1]
Platform | Theory (GF/s) | Observed | % |
---|---|---|---|
x86 (Cache) | 1075.3 | 781.4 | 72.7 |
x86 (RAM) | 11.7 | 9.4 | 80.1 |
GPU | 74.8 | 72.4 | 96.8 |
For RAM writes, one implicit read: \(I_\text{RAM} = \frac{1}{12}\)
\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]
y[i] = a * x[i-1] + b * x[i] + a * x[i+1]
x[i] = y[i]
4 FLOPs (amortized) per 4×8 bytes: \(I = \frac{1}{8}\)
RAM-bound: \(I_\text{RAM} = \frac{1}{10}\)
Is it four or five operations?
\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]
y[i] = x[i] + a * (x[i-1] - 2 * x[i] + x[i+1])
\[\phi^{n+1}_i = \left( 1 - 2 \frac{\Delta t}{\Delta x^2} \right) \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} + \phi^n_{i-1} \right)\]
y[i] = a * x[i] + b * (x[i-1] + x[i+1])
Do we follow parentheses? Ask the compiler!
Platform | +1 | +8 | Theory |
---|---|---|---|
x86 (Cache) | 499.9 | 690.7 | 1075.2 |
x86 (RAM) | 10.8 | 10.8 | 17.6 |
GPU | 81.1 | 81.1 | 112.3 |
Single-displacement has a performance penalty
Platform | Theory (GF/s) | Observed | % |
---|---|---|---|
x86 (Cache) | 1075.2 | 690.7 | 64.2 |
x86 (RAM) | 14.1 | 10.8 | 76.7 |
GPU | 89.9 | 81.1 | 90.2 |
Significant drop in cache performance
Stage | Observed | Theory | % |
---|---|---|---|
\(y \leftarrow x + \alpha \nabla^2 x\) | 1085 GF/s | 1434 GF/s | 75.7 |
\(x \leftarrow y\) | 9145 GB/s | 9216* GB/s | 99.2 |
S1 + S2 | 691 GF/s | 1075 GF/s | 64.2 |
Restarting on a new array incurs a performance penalty
NOTE: L1⇋RAM transfer obscures this cost!
Why 9216 GB/s?
High FLOPs don't necessarily indicate efficiency
Can we measure the subroutine intensity?
Estimate bandwidth by ratio of instructions.
\[I \approx \frac{N_\text{FLOPs}} {8 \times \left( N_\text{iLoad} + N_\text{iStore} \right) \times \alpha_\text{vec}}\]
where \(\alpha_\text{vec} = N_\text{FLOPs} / N_\text{iFLOPs}\)
Subroutine | Runtime | Speedup |
---|---|---|
hor_visc |
1.69s → 1.01s | 1.67x |
corAdCalc |
1.11s → 0.64s | 1.74x |
Highest points correspond to recent optimization efforts
(NOTE: Timings are from a different config)
Further Work