{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "# Day 16 - Roofline Measurements\n\nHands-on companion for GPU architecture, arithmetic intensity, and roofline estimates. The core checks use NumPy and run anywhere; CUDA timing is optional.\n"
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": "import math, time\nimport numpy as np\nnp.set_printoptions(precision=4, suppress=True)\nprint('numpy', np.__version__)\n",
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 1. Ridge Point\n\nThe ridge is `peak FLOP/s / memory bandwidth`. Left of the ridge, memory bandwidth limits throughput.\n"
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": "def ridge_point(peak_tflops, bandwidth_tb_s):\n    return peak_tflops / bandwidth_tb_s\n\nfor name, peak, bw in [('A100 FP16 dense', 312, 2.0), ('H100 FP16-ish dense', 990, 3.35)]:\n    print(f'{name:18s}: ridge = {ridge_point(peak, bw):.1f} FLOPs/byte')\n\nassert abs(ridge_point(312, 2.0) - 156.0) < 1e-9\n",
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 2. Arithmetic Intensity Calculators\n"
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": "def ai_decode_attention(d, dtype_bytes=2):\n    flops = 2 * d\n    bytes_moved = 2 * d * dtype_bytes\n    return flops / bytes_moved\n\ndef ai_ffn_batch(batch):\n    return batch / 2\n\ndef classify(ai, ridge=156):\n    return 'compute-bound' if ai >= ridge else 'memory-bound'\n\nexamples = [\n    ('decode attention d=4096', ai_decode_attention(4096)),\n    ('FFN batch=1', ai_ffn_batch(1)),\n    ('FFN batch=64', ai_ffn_batch(64)),\n    ('prefill attention rough d/4', 4096 / 4),\n]\nfor name, ai in examples:\n    print(f'{name:30s} AI={ai:8.2f} -> {classify(ai)}')\n",
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 3. Roofline Bound\n\nPredicted throughput is `min(peak, AI * bandwidth)`.\n"
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": "def roofline_tflops(ai, peak_tflops=312, bandwidth_tb_s=2.0):\n    return min(peak_tflops, ai * bandwidth_tb_s)\n\nfor name, ai in examples:\n    print(f'{name:30s} predicted <= {roofline_tflops(ai):8.2f} TFLOP/s')\n",
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 4. Optional PyTorch Timing\n\nThis runs if PyTorch is installed. On CUDA, replace wall-clock timers with `torch.cuda.Event` for cleaner measurements.\n"
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": "try:\n    import torch\n    device = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')\n    print('torch', torch.__version__, 'device', device)\n    n = 512\n    a = torch.randn(n, n, device=device)\n    b = torch.randn(n, n, device=device)\n    # warmup\n    for _ in range(3):\n        c = a @ b\n    if device == 'cuda':\n        torch.cuda.synchronize()\n    elif device == 'mps':\n        torch.mps.synchronize()\n    t0 = time.perf_counter()\n    c = a @ b\n    if device == 'cuda':\n        torch.cuda.synchronize()\n    elif device == 'mps':\n        torch.mps.synchronize()\n    elapsed = time.perf_counter() - t0\n    flops = 2 * n**3\n    print(f'{n}x{n} matmul: {elapsed*1e3:.2f} ms, {flops/elapsed/1e12:.3f} TFLOP/s')\nexcept Exception as e:\n    print('Skipping PyTorch timing:', e)\n",
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 5. Exercise Check\n"
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": "answers = {\n    'decode_d4096': ai_decode_attention(4096),\n    'ffn_b1': ai_ffn_batch(1),\n    'ffn_b64': ai_ffn_batch(64),\n}\nprint(answers)\nassert answers['decode_d4096'] == 0.5\nassert answers['ffn_b1'] == 0.5\nassert answers['ffn_b64'] == 32.0\n",
      "outputs": [],
      "execution_count": null
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "pygments_lexer": "ipython3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
