PyTorch 推理工程（07）：Profiling、Benchmark 与瓶颈定位

1. 本节定位

性能问题需要可复现的测量与分解：区分主机预处理与设备计算、算子时间与同步开销、冷启动与稳态、batch 规模与硬件利用率（算力或带宽约束）。本篇整理 PyTorch 侧计时、benchmark 与 Profiler 的基本用法及常见误判来源（尤其未同步的 CUDA 计时）。

2. 性能分析流程

性能分析里最常见的错误：

先改一堆东西
发现稍微快了点
完全不知道是哪一步起作用的

正确流程：

第 1 步：定义待测指标
    ↓ 单次延迟？平均吞吐？某个模块耗时？显存峰值？
第 2 步：保证计时方法正确
    ↓ CUDA 异步执行，不做同步就测 = 测了个寂寞
第 3 步：benchmark 做对比
    ↓ 对照组 vs 实验组，公平比较
第 4 步：profiler 找热点
    ↓ 哪个 op 最贵？CPU 还是 GPU？哪行代码触发的？
第 5 步：针对性优化
    ↓ 仅修改已定位瓶颈相关实现，避免无依据改动

这个流程比"先 half 一下、compile 一下试试"专业得多。

3. 为什么 `time.time()` 在 CUDA 场景下不靠谱

很多人测推理时间，会这样写：

import time

start = time.time()
y = model(x)
end = time.time()
print(f"耗时: {(end - start) * 1000:.2f} ms")

在 CPU 场景里还算凑合，但在 CUDA 推理里，这经常是错的。

原因：CUDA 操作是异步发起的

GPU 为异步执行设备。当执行 y = model(x) 时：

CPU 把计算指令发给 GPU 就返回了
GPU 还在后台执行
这时 time.time() 记录的是"CPU 发完指令的时间"，不是"GPU 真正执行完的时间"

1 2	CPU: [发指令] → [调用 time.time()] ← 常见打点在主机侧 GPU: [还在跑...]

该方式测得的间隔可能显著短于 GPU 实际执行时间。

同步后再计时：更接近真实

import time
import torch

# 先确保 GPU 把之前的工作做完
torch.cuda.synchronize()
start = time.time()

y = model(x)

# 再等 GPU 把这次的工作做完
torch.cuda.synchronize()
end = time.time()

print(f"GPU 推理耗时: {(end - start) * 1000:.2f} ms")

torch.cuda.synchronize() 的作用：等待 GPU 完成当前所有已发起的操作，把 CPU 和 GPU 的时间线"对齐"。

记住这个规则：CUDA 计时不做同步，很多结论都是虚的。

对比实验：感受差距

import torch
import torch.nn as nn
import time

device = "cuda" if torch.cuda.is_available() else "cpu"
model = nn.Sequential(nn.Linear(512, 2048), nn.ReLU(), nn.Linear(2048, 512)).to(device).eval()
x = torch.randn(64, 512, device=device)

with torch.inference_mode():
    # 错误方式
    start = time.time()
    y = model(x)
    end = time.time()
    print(f"[未同步] 计时: {(end - start) * 1000:.3f} ms")

    # 正确方式
    torch.cuda.synchronize()
    start = time.time()
    y = model(x)
    torch.cuda.synchronize()
    end = time.time()
    print(f"[已同步] 计时: {(end - start) * 1000:.3f} ms")

未同步与同步后的读数可相差数量级（例如 0.01 ms 量级 vs 1–5 ms）。

4. `torch.utils.benchmark`：比手写循环更专业的 benchmark 工具

虽然同步后的 time.time() 比不同步好，但还有个问题：一次计时的随机性太大。

torch.utils.benchmark.Timer 是 PyTorch 专为这类对比实验设计的工具，它比手写 for + time.time() 好在：

自动 warmup：首次运行有 CUDA 上下文初始化、kernel 选择等额外开销，会自动跳过
自动重复：运行足够多次，统计稳定的均值、中位数、IQR
知道 GPU 是异步的：在必要时自动处理同步
更公平的线程环境：控制 CPU 线程池，避免并发干扰对比结果

最小示例

import torch
import torch.nn as nn
import torch.utils.benchmark as benchmark

device = "cuda" if torch.cuda.is_available() else "cpu"

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 64),
).to(device).eval()

x = torch.randn(32, 128, device=device)

# 用 inference_mode 包住 stmt（在 globals 里设置好上下文）
t = benchmark.Timer(
    stmt="with torch.inference_mode(): model(x)",
    globals={"model": model, "x": x, "torch": torch},
    label="MLP forward",
    sub_label="batch=32",
)

result = t.blocked_autorange(min_run_time=1.0)
print(result)

理解输出

MLP forward: batch=32
  Median: 0.12 ms
  IQR:    0.01 ms (0.12 to 0.13)
  337 measurements, 100 runs per measurement, 1 thread

字段	含义
`Median`	中位数延迟，比均值更抗噪声
`IQR`	四分位距（第 75 百分位 - 第 25 百分位），越小说明越稳定
`measurements`	跑了多少组
`runs per measurement`	每组跑多少次（自动决定）

比较两个实现

benchmark 真正强大的地方是对比实验：

results = []

# FP32 基线
model_fp32 = nn.Linear(512, 512).to(device).eval()
x_fp32 = torch.randn(64, 512, device=device)

t1 = benchmark.Timer(
    stmt="with torch.inference_mode(): model(x)",
    globals={"model": model_fp32, "x": x_fp32, "torch": torch},
    label="Linear 512x512",
    sub_label="FP32",
)
results.append(t1.blocked_autorange(min_run_time=1.0))

# FP16 autocast
t2 = benchmark.Timer(
    stmt="""
with torch.inference_mode():
    with torch.autocast(device_type='cuda', dtype=torch.float16):
        model(x)
""",
    globals={"model": model_fp32, "x": x_fp32, "torch": torch},
    label="Linear 512x512",
    sub_label="FP16 autocast",
)
results.append(t2.blocked_autorange(min_run_time=1.0))

# 打印对比表
compare = benchmark.Compare(results)
compare.print()

输出为各实现耗时的对比表及统计量。

5. `torch.profiler`：找到"是什么慢"而不只是"多慢"

benchmark 侧重整体耗时；profiler 将耗时分解到算子与调用栈。

profiler 能回答的问题

哪个 op 耗时最多？（CPU 时间 vs CUDA 时间）
某个 op 被调用了多少次？
某个 op 处理的输入 shape 是多少？
这段热点代码在模型的哪一行触发的？
是哪类 op 占主导：matmul、LayerNorm、Softmax？
显存分配和释放有没有异常？

最基础的 profiler 写法

import torch
import torch.nn as nn

device = "cuda" if torch.cuda.is_available() else "cpu"

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 64),
).to(device).eval()

x = torch.randn(32, 128, device=device)

with torch.inference_mode():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        record_shapes=True,      # 记录每个 op 的输入 shape
        with_stack=True,         # 记录代码调用位置（文件 + 行号）
    ) as prof:
        y = model(x)

# 按 CUDA 耗时降序，打印前 20 个热点
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))

6. 读懂 profiler 输出表

这是初学者最容易卡住的地方。下面是一个典型的输出（已简化）：

---------------------------------  -------  -------  -------  -------  --------
Name                               CPU %    CPU us   CUDA %   CUDA us  Calls
---------------------------------  -------  -------  -------  -------  --------
aten::linear                        5.2%      120      62.1%    2450      2
aten::addmm                         2.1%       48      58.3%    2300      2
aten::relu                          1.0%       23       7.2%     285      1
aten::t                             0.8%       18       2.1%      83      2
ProfilerStep*                      91.0%     2100       0.0%       0      1
---------------------------------  -------  -------  -------  -------  --------

列名	含义
`Name`	操作名称（aten op）
`CPU us`	CPU 侧累计花费的时间（微秒）
`CUDA us`	GPU 侧实际执行花费的时间（微秒）
`Calls`	被调用了多少次

关键：self_cuda_time_total vs cuda_time_total

cuda_time_total：这个 op 及其所有子调用的 GPU 时间总和（“包含子操作”）
self_cuda_time_total：这个 op 自身的 GPU 时间，不含子调用

宜优先关注 self_cuda_time_total：aten::linear 等会调用子算子（如 aten::addmm），若仅看 cuda_time_total 易重复计入子调用；self_cuda_time_total 更贴近该算子自身开销。

# 看 CUDA 侧自身耗时排名
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))

# 看 CPU 侧自身耗时排名（排查 Python 开销）
print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=20))

7. profiler 的各个参数：分别有什么用

`activities`：记录哪些设备的活动

activities=[
    torch.profiler.ProfilerActivity.CPU,    # Python 端、算子调度
    torch.profiler.ProfilerActivity.CUDA,   # GPU kernel 执行
]

两个都开才能看到完整的 CPU+GPU 时间分布。

`record_shapes=True`：记录每个 op 的输入形状

1
2
3

热点在 aten::addmm，输入 shape 是 [32, 128] @ [128, 256]
                                           ↑
                           这是 batch=32 下的 Linear(128, 256)

有了 shape 信息，才能判断"这个 op 贵，是因为它本来就贵，还是因为这个 shape 让它特别贵？"

`with_stack=True`：记录热点来自代码哪一行

1
2
3

aten::addmm 热点
  来自：model.py, line 45, forward()
             └── transformer_block.py, line 12, attention()

若已知热点为 matmul 但需映射到具体层，可启用 with_stack=True 获取 Python 调用栈。

`profile_memory=True`：跟踪显存分配和释放

with torch.profiler.profile(
    activities=[...],
    profile_memory=True,   # 会大幅增加 profiler 自身开销
) as prof:
    y = model(x)

适合排查：某段前向是不是在频繁分配临时 Tensor，或者显存峰值为什么突然高了。

⚠️ 注意：profile_memory=True 会显著增加 profiler 本身的开销，不要在追求准确吞吐测量时同时开启。

`with_flops=True`：估算部分 op 的计算量

对 matmul、conv2d 等 op 提供 FLOPs 估算，帮助判断"是计算密集还是带宽密集"。

FLOPs 不等于性能。同样的 FLOPs，如果数据访问模式不好（比如非连续内存），实际执行可能很慢。

8. 导出 Chrome trace：用时间线看执行结构

key_averages().table() 只能看热点排名，但不能看"执行结构"——CPU 和 GPU 是否重叠、哪里有空洞、kernel 是不是碎片化。

这时要导出 Chrome trace：

with torch.inference_mode():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        record_shapes=True,
    ) as prof:
        y = model(x)

# 导出 trace 文件
prof.export_chrome_trace("trace.json")

然后在浏览器打开 chrome://tracing，把 trace.json 拖入，就能看到完整的时间线视图：

1
2
3

CPU:  [dispatch] [dispatch] [dispatch]
GPU:                 [kernel A] [kernel B] [kernel C]
      ←────── 重叠执行，好 ──────→

1
2
3

CPU:  [dispatch] [等待] [dispatch] [等待] [dispatch]
GPU:             [A]              [B]              [C]
      ← 串行等待，坏，说明有同步点 →

时间线能发现的问题：

CPU/GPU 长时间没有重叠 → 有不必要的同步点
GPU 时间线里有大量零散小 kernel → 可能需要 kernel fusion
GPU 有大段空白 → 数据供给不足（CPU 预处理太慢或搬运是瓶颈）

9. schedule：长推理循环里如何按需 profiling

对长循环推理若全程全量记录，profiler 自身开销与 trace 体积均会扭曲结果或难以分析。

schedule 用于限定在指定步区间内开启采集，以控制开销与 trace 规模：

import torch
import torch.nn as nn

device = "cuda" if torch.cuda.is_available() else "cpu"
model = nn.Linear(128, 64).to(device).eval()
x = torch.randn(32, 128, device=device)

with torch.inference_mode():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        schedule=torch.profiler.schedule(
            skip_first=2,   # 跳过前 2 轮（CUDA 上下文初始化）
            wait=1,         # 等待 1 轮（不记录，让系统稳定）
            warmup=1,       # warmup 1 轮（记录但不纳入分析）
            active=3,       # 只分析这 3 轮
            repeat=1,       # 只做一次 skip+wait+warmup+active 循环
        ),
        on_trace_ready=torch.profiler.tensorboard_trace_handler("./log"),
    ) as prof:
        for i in range(10):
            y = model(x)
            prof.step()   # 告诉 profiler：这一轮结束了

各阶段含义：

轮次:  0    1    2    3    4    5    6    7    8    9
阶段:  skip skip wait warm act  act  act  done done done
                 ↑              ↑
          不记录，稳定       这 3 轮被完整记录

prof.step() 是关键——它告诉 profiler “当前轮结束”，profiler 才知道要切换到下一个阶段。

10. profiler vs benchmark：分工清晰

很多人会把两者混用，这里说清楚：

工具	核心问题	输出形式
`torch.profiler`	哪里慢？热点在哪？	op 表格 + 时间线
`torch.utils.benchmark.Timer`	哪个实现更快？快了多少？	中位数 + IQR + 对比表

标准工作流：

1 2	先用 benchmark 确认："A 比 B 快了 30%" 再用 profiler 解释："因为 A 的 attention QK matmul 用了 float16，节省了 2x CUDA 时间"

仅有 benchmark 无量化归因 → 难以解释加速来源、难以迁移到其他配置
仅有 profiler 无前后对比 → 难以量化优化收益

11. 冷启动与稳态须分别讨论

这是容易被忽视的问题。

import torch
import torch.nn as nn
import time

device = "cuda" if torch.cuda.is_available() else "cpu"
model = nn.Linear(512, 512).to(device).eval()
x = torch.randn(64, 512, device=device)

with torch.inference_mode():
    for i in range(6):
        torch.cuda.synchronize()
        t0 = time.time()
        y = model(x)
        torch.cuda.synchronize()
        t1 = time.time()
        print(f"第 {i+1} 次: {(t1 - t0) * 1000:.2f} ms")

典型日志类似：

第 1 次: 45.23 ms  ← 首次：CUDA 上下文初始化 + kernel 选择
第 2 次:  1.12 ms  ← 稳态开始
第 3 次:  1.09 ms
第 4 次:  1.11 ms
第 5 次:  1.08 ms
第 6 次:  1.10 ms

第 1 次慢了 40 倍，但这不是"模型慢"，而是冷启动开销。

首轮慢的常见原因	稳态慢的常见原因
CUDA 上下文初始化	模型算子本身开销
cuDNN/cuBLAS kernel 选择	batch 配置不合理
`torch.compile()` 编译	内存带宽瓶颈
GPU 内存分配首次开销	数据搬运频繁
Python lazy initialization	同步点过多

结论：benchmark 前宜 warmup，并将首轮与稳态分开报告。

12. 一个完整的性能分析案例

可将下列步骤串为一次完整诊断流程：

import torch
import torch.nn as nn
import torch.utils.benchmark as benchmark
import time

device = "cuda" if torch.cuda.is_available() else "cpu"

# 待分析的模型
model = nn.Sequential(
    nn.Linear(512, 2048),
    nn.ReLU(),
    nn.Linear(2048, 2048),
    nn.ReLU(),
    nn.Linear(2048, 512)
).to(device).eval()

x = torch.randn(64, 512, device=device)

# ─── 第 1 步：正确计时，观察首轮慢 ───────────────────────────────────────
print("=== 第 1 步：观察首轮 vs 稳态 ===")
with torch.inference_mode():
    for i in range(5):
        if device == "cuda":
            torch.cuda.synchronize()
        t0 = time.time()
        y = model(x)
        if device == "cuda":
            torch.cuda.synchronize()
        t1 = time.time()
        print(f"  第 {i+1} 次: {(t1-t0)*1000:.2f} ms")

# ─── 第 2 步：benchmark 稳态性能 ─────────────────────────────────────────
print("\n=== 第 2 步：benchmark ===")
t = benchmark.Timer(
    stmt="with torch.inference_mode(): model(x)",
    globals={"model": model, "x": x, "torch": torch},
    label="MLP",
    sub_label="FP32, batch=64",
)
result = t.blocked_autorange(min_run_time=2.0)
print(result)

# ─── 第 3 步：profiler 找热点 ────────────────────────────────────────────
print("\n=== 第 3 步：profiler 热点分析 ===")
with torch.inference_mode():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        record_shapes=True,
        with_stack=True,
    ) as prof:
        # 先 warmup 几次
        for _ in range(3):
            y = model(x)
        # 再 profile 一次
        y = model(x)

# 打印 CUDA 热点
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=15))

# 导出时间线（可选）
prof.export_chrome_trace("mlp_trace.json")
print("trace 已导出到 mlp_trace.json，用 chrome://tracing 打开查看")

13. 推理代码里的"隐形同步坑"：用 profiler 意识排查

即使没开 profiler，也要带着这种意识看代码：

# 反例: 每次循环都 .item()：触发 GPU→CPU 同步
for batch in dataloader:
    output = model(batch.to(device))
    loss = output.sum().item()   # ← 每次都同步，GPU 等一次，大幅降低吞吐

# 示例: 推理结束后统一处理
outputs = []
with torch.inference_mode():
    for batch in dataloader:
        y = model(batch.to(device))
        outputs.append(y)   # 保留在 GPU 上
# 循环结束后一次性取结果
all_outputs = torch.cat(outputs, dim=0)
total_sum = all_outputs.sum().item()   # 只同步一次

# 反例: 打印中间结果：可能触发同步
for i in range(100):
    y = model(x)
    print(f"step {i}: output sum = {y.sum()}")   # 每次打印都要等 GPU

# 示例: 只在需要时打印
for i in range(100):
    y = model(x)
    if i % 20 == 0 and i > 0:
        torch.cuda.synchronize()
        print(f"step {i}: 检查点")

14. 看 profiler 输出时，应该怎么想

阅读 profiler 汇总表时，除排名第一项外，宜结合占比分布与调用关系综合判断。

系统性阅读步骤：

① 先看总 CPU vs CUDA 时间比例

1	CPU total: 500 μs CUDA total: 2000 μs

CUDA 占大头 → GPU 是主要执行者，符合预期
CPU 占大头 → 可能有大量 Python 开销或同步等待

② 看是少数大 op 还是大量小 op

1 2	情况 A：matmul 占 80% CUDA 时间，其余分散 → 单一瓶颈，集中优化情况 B：300 个小 op 各占 0.3% CUDA 时间 → 碎片化，考虑 kernel fusion

③ 看 Calls 列

某个 op 单次不贵（10μs），但被调用了 500 次 → 累计 5ms，可能值得优化

④ 看 shape（需要 `record_shapes=True`）

热点 op 的 shape 是不是很"尴尬"（比如 matmul [1, 4096] @ [4096, 512]，batch=1 没法吃满 GPU）

⑤ 看 stack（需要 `with_stack=True`）

热点映射到 attention.py:35 还是 preprocess.py:12 等自有代码？

15. 性能分析的常见误区

误区	正确认识
看一次 time.time() 就下结论	CUDA 异步，必须同步后计时，而且要多次取中位数
首次运行慢就以为模型慢	冷启动开销，warmup 之后才是稳态
FLOPs 高的 op 一定慢	实际性能取决于是否能吃满算力，带宽和访存也是关键
profiler 全开所有选项才准	profiler 本身有开销，`profile_memory=True` 会显著影响时序
第一个热点就是唯一瓶颈	可能有多个同等量级的热点，改一个另一个会浮出来
快了就行，不需要解释	benchmark 对比 + profiler 解释 = 可推广的结论

16. 面试常见问题

Q：为什么 CUDA 场景下直接用 time.time() 包住前向不准确？
A：CUDA 操作是异步发起的，model(x) 返回时 GPU 可能还在执行，time.time() 测到的只是"CPU 发完指令"的时间，不是 GPU 执行完成的时间。必须用 torch.cuda.synchronize() 确保 GPU 完成后再打时间戳。

Q：torch.profiler 可提供哪些观测项？
A：可以看哪些 op 最贵（CPU time / CUDA time）、op 的调用次数、输入 shape、代码调用栈位置，以及导出时间线查看 CPU/GPU 重叠情况和 kernel 碎片化。

Q：self_cuda_time_total 和 cuda_time_total 有什么区别？
A：cuda_time_total 包含子 op 的时间（比如 aten::linear 会包含它内部调用的 aten::addmm），self_cuda_time_total 只看这个 op 自身的时间。分析热点时优先看 self_cuda_time_total，避免重复计算。

Q：torch.utils.benchmark.Timer 比手写循环计时好在哪里？
A：自动处理 warmup（跳过 CUDA 冷启动）、控制线程池保证对比公平、处理 GPU 异步同步、统计中位数和 IQR 而不只是均值，整体结果更可靠，适合对比两个实现的性能差距。

Q：profiler 和 benchmark 分别用来干什么？
A：benchmark 回答"谁更快、快了多少"；profiler 回答"慢在哪里、为什么"。标准工作流：先 benchmark 确认收益，再 profiler 定位原因。

Q：为什么推理分析中要区分冷启动和稳态？
A：首轮有 CUDA 上下文初始化、cuBLAS kernel 选择、torch.compile 编译等额外开销，远慢于稳态。如果不 warmup 就计时，结论会严重偏向冷启动而夸大延迟。

17. 思考题

练习 1：感受 CUDA 计时的差距

import torch, torch.nn as nn, time

device = "cuda" if torch.cuda.is_available() else "cpu"
model = nn.Linear(1024, 1024).to(device).eval()
x = torch.randn(64, 1024, device=device)

with torch.inference_mode():
    # 不同步计时
    t0 = time.time(); y = model(x); t1 = time.time()
    print(f"不同步: {(t1-t0)*1000:.3f} ms")

    # 同步计时
    torch.cuda.synchronize()
    t0 = time.time(); y = model(x); torch.cuda.synchronize(); t1 = time.time()
    print(f"同步:   {(t1-t0)*1000:.3f} ms")

思考：两次结果差了多少？不同步时测到的"时间"为什么这么短？

练习 2：观察冷启动

1	# 对同一个模型连续调用 6 次，记录每次时间，观察第 1 次和后续的差异

思考：第 1 次慢的原因是什么？第 2 次之后时间为什么稳定了？

练习 3：最小 profiler 实验

对一个两层 MLP 运行 profiler，然后：

打印 sort_by="self_cuda_time_total" 的热点表
找出排名第一的 op，思考它是什么操作

进阶：启用 record_shapes=True，核对热点算子的输入 shape 与模型定义是否一致

练习 4：用 benchmark 做对比实验

选一组对比（任选其一）：

eager vs torch.compile()
FP32 forward vs autocast float16 forward
batch=1 vs batch=64

用 benchmark.Timer + blocked_autorange 测出中位数，打印对比结果。

思考：IQR 大说明什么？中位数和均值哪个更能代表"典型性能"？

练习 5：导出 Chrome trace 并查看时间线

在练习 3 的基础上，加一行：

1	prof.export_chrome_trace("my_trace.json")

用 chrome://tracing 打开 trace，找到：

CPU 调度层
CUDA kernel 层
两者是否有重叠

思考：时间线里有没有 CPU 和 GPU 都空着的"空洞"？空洞说明什么？

18. 本节要点与自检

知道 CUDA 计时为什么必须同步，会正确写同步计时代码
理解冷启动和稳态的区别，做 benchmark 时会自动 warmup
能用 torch.profiler 找到推理中的热点 op，读懂 self_cuda_time_total 等列的含义
理解 record_shapes、with_stack、profile_memory 各自解决什么问题
能用 torch.utils.benchmark.Timer 做公平的性能对比实验
知道 profiler 偏"诊断"，benchmark 偏"比较"，会配合使用
面对"这个模型慢"时，能给出系统的分析流程而不是拍脑袋猜

19. 小结

真正的推理优化不是"多试几个开关"，而是"先用正确计时排除噪声，用 benchmark 量化收益，用 profiler 定位热点，最后才针对性优化"。

系列导航

上一篇：PyTorch推理工程：06 torch.compile、导出与 ONNX
下一篇：PyTorch推理工程：08 批处理、KV Cache 与 Serving 视角
系列索引（00 导读）