mini-infer系统实战-24-Grouped Execution：过线之后，更重要的是把收益和代价讲清楚

如果只看结果，Phase 21 很容易被讲成一句很顺的话：

我把 MoE EP 的 local expert execution 从 per-expert scatter/gather 推进成了 grouped execution，最终 benchmark 过线了。

这句话不算错，但它太平了。
真正值得写出来的，不是“grouped execution 做出来了”，而是这轮实现里有三个更像大厂推理组会追问的问题：

为什么 Phase 20 之后，瓶颈会自然转到 local expert execution？
为什么第一版 grouped contiguous slices 虽然方向正确，却还是过不了正式性能 gate？
为什么最后能过线的版本，必须把新增 runtime resident cache 也一起讲清楚？

Phase 21 的正式 benchmark 平均结果是：

dense：21878.76 tok/s
ep_padded：42765.91 tok/s
ep_packed：51121.99 tok/s
ep_grouped：54696.73 tok/s
EP grouped / dense = 2.500x
EP grouped / EP packed = 1.070x
max_abs_diff_grouped = 0.000000
shard_ratio = 0.5002

这些数据都来自该阶段的正式 benchmark，不是估算值。

但这轮最重要的结论不是“又快了一点”，而是：

grouped 路径确实把 local expert hot path 从 per-expert scatter/gather 推到了 batched matmul；不过它最终过线，依赖的是一份显式可见的 resident local gate/up cache，这是一笔真实工程代价，不是免费优化。

背景：Phase 20 已经证明 control plane 不是主要瓶颈

Phase 20 做完以后，synthetic MoE EP 路径已经有了很清楚的结论：

ep_packed_bytes_per_layer = ep_ideal_bytes_per_layer = 262144
EP packed / dense = 2.310x
EP packed / EP padded = 1.204x
ep_packed_control_plane_share ≈ 1.94%
max_abs_diff_packed = 0.000000

也就是说：

payload bytes 已经 exact
true expert sharding 没回退
control plane 已经被量到一个很小的占比

所以如果这时候还继续纠结 split-size helper 本身，就开始偏离主矛盾了。
真正剩下的热点更像是 moe_layer.py 里 local expert compute 的组织方式。

在进入 Phase 21 前，local expert 执行本质上还是一条比较朴素的 per-expert 路径：

torch.where(...) 找出当前 expert 的 token
index_select(...) 把输入 gather 出来
进 expert FFN
index_copy_(...) 再把输出写回去

这条路径的问题不是“不能跑”，而是它在 token 分散到多个 expert 时，会留下很明显的 scatter/gather 开销。
前面通信、packed bytes、control plane 都已经收住之后，这条 local hot path 就自然成了下一阶段最值得打的地方。

问题定义：不是要“把 grouped mode 接进去”，而是要让 grouped 真正比 packed 有可复现收益

Phase 21 的验收标准不是抽象的“支持 grouped execution”，而是很硬的两条性能线：

ep_grouped >= ep_packed × 1.05
ep_grouped >= dense × 2.35

同时还不能回退前面已经做对的三件事：

ep_grouped_bytes_per_layer == ep_ideal_bytes_per_layer
shard_ratio <= 0.52
max_abs_diff_grouped < 1e-4

这意味着 grouped 不能靠以下方式“假装成功”：

偷偷把 bytes 口径改坏
用额外复制一整份权重却不计入 benchmark
把前处理挪出计时窗口
用数值漂移换吞吐

所以这轮真正难的地方，不是把 expert_exec_mode="grouped" 这个接口挂上去，而是要在收益、正确性、内存口径这三件事之间同时说得过去。

第一版实现：方向对了，但正式性能还是不过线

最开始那版 grouped execution 的思路其实很自然：

先按 local expert 构造 contiguous slices
用 grouped metadata 描述每个 expert 对应的连续区间
把原本的 where/index_select/index_copy_ 换成按 contiguous slice 的 grouped 执行

这条思路本身没有问题，而且很快就做到了：

exact bytes 不回退
true expert sharding 不回退
max_abs_diff_grouped = 0.000000

但第一版正式 benchmark 的结果并不好看。那一版一度只有：

EP grouped / dense ≈ 2.323x
EP grouped / EP packed ≈ 1.011x

也就是说，热点虽然打中了，但收益不够。

这类结果在系统优化里很典型：
你已经把明显的坏味道去掉了，但如果真正的 compute 组织方式还不够强，最终只会得到“比 baseline 稍微好一点”，而不是过 gate。

profiler 给出的证据：热点确实打中了

Phase 21 没有只看吞吐，还补了一组等效 profiling，专门看 source-rank local shard 的 expert compute hot path。

同一份 local workload 上，对比：

_apply_experts_naive()
_apply_experts_grouped()

结果很清楚：

指标	naive local expert	grouped local expert
total CUDA self time	`1.5430 ms`	`0.5590 ms`
top-1 op	`cutlass fp16 gemm 0.6110 ms (39.60%)`	`cutlass fp16 gemm 0.2930 ms (52.42%)`
top-2 op	`indexSelectSmallIndex 0.2400 ms (15.55%)`	`ampere_fp16_s16816gemm 0.2000 ms (35.78%)`
top-3 op	`index_copy kernel 0.1440 ms (9.33%)`	`CatArrayBatchedCopy 0.0260 ms (4.65%)`
`Memcpy DtoH` share	`0.0800 ms (5.18%)`	`0.0000 ms (0.00%)`

这组数据说明：

naive 路径里，indexSelectSmallIndex + index_copy 合计约 24.88% 的 CUDA self time
grouped 路径里，这些 scatter/gather 热点已经不再是主要 op
local expert compute 组织方式确实从“per-expert gather/scatter”转成了“以 batched matmul 为主”

所以问题已经不是“热点有没有打中”，而是：

为什么已经打中了热点，正式吞吐还是差那么一点点？

原因：第一版 grouped 还没有把 local compute 收到足够紧

根因最后收敛到了一个很具体的事实：

grouped 版本虽然已经把 where/index_select/index_copy_ 的大头干掉了
但 down_proj 仍然保留在 per-expert 路径上
同时 gate/up 的打包还在 forward 里重复发生

从实现上看，Phase 21 最终真正起作用的改动集中在 moe_layer.py。

这轮最终实现里，EPMoELayer 新增了 resident grouped cache：

1 2	self._grouped_resident_batched_params: Optional[_GroupedBatchedExpertParams] = None self._grouped_resident_batched_params_dtype: Optional[torch.dtype] = None

并通过：

_build_grouped_resident_batched_params(...)
_get_grouped_batched_params(...)
grouped_resident_gateup_cache_bytes()

把 local gate/up 的 batched 参数收成一份 resident cache。

关键代码路径在 _apply_experts_grouped_batched()：

params = self._get_grouped_batched_params(
    [run.local_expert_id for run in metadata.runs],
    compute_dtype=compute_dtype,
)

gate = torch.bmm(padded_inputs, params.gate_weight_t)
up = torch.bmm(padded_inputs, params.up_weight_t)
hidden = F.silu(gate) * up

然后 down_proj 不再走 nn.Linear 模块调用，而是直接：

1
2
3

expert_out = torch.mm(expert_hidden, expert.down_proj.weight.transpose(0, 1))
if expert.down_proj.bias is not None:
    expert_out = expert_out + expert.down_proj.bias

这里的逻辑很值得讲清楚：

gate/up 现在借助 resident cache 走 batched 路径
down_proj 仍然是 per-expert
所以这还不是“最终的 grouped GEMM / kernel 级实现”
但它已经足够把吞吐推进到 Phase 21 的 gate 之上

这里真正的 trade-off：吞吐不是白来的

如果文章只写到这里，读者很容易产生错觉：

太好了，只要把 grouped contiguous slices + resident cache 做出来，性能自然就上去了。

但这恰恰是我不想把它写成“成功故事”的原因。
这一版 grouped 能过线，不是没代价，而是用新增 runtime resident cache 换来的。

Phase 21 的 benchmark 现在会显式输出：

dense_runtime_param_bytes = 25174016
ep_rank_runtime_param_bytes = 12591104
ep_grouped_runtime_gateup_cache_bytes = 8388608
ep_grouped_runtime_resident_bytes = 20979712
ep_grouped_runtime_resident_ratio = 0.8334

这些字段在 benchmark_moe.py 到 benchmark_moe.py 里被显式算出来，不再只靠 state_dict 口径去讲故事。

这组数据的含义很直接：

Phase 18 的 shard_ratio = 0.5002 仍成立，因为它描述的是 rank-local shard 的参数口径
但 grouped 模式下，运行时实际上又常驻了一份 local gate/up packed-weight cache
所以 grouped 的收益，是建立在更高 runtime resident bytes 基础上的

这轮最重要的工程态度就是：

不把这份 resident cache 藏在实现细节里，而是把它显式纳入 benchmark 输出。

如果你想拿这个项目去面试推理核心组，这一点非常关键。
因为真正的追问一定不是“你有没有做 grouped execution”，而是：

你吞吐变快的代价是什么？
这笔代价是不是内存？
这笔内存代价有没有被 benchmark 诚实记录？

Phase 21 现在至少可以正面回答这些问题。

正式结果：gate 通过，但边界也更清楚了

按当前最终实现，官方 workload 串行两次平均结果是：

dense 21878.76 tok/s
ep_padded 42765.91 tok/s
ep_packed 51121.99 tok/s
ep_grouped 54696.73 tok/s
EP grouped / dense = 2.500x
EP grouped / EP packed = 1.070x
max_abs_diff_grouped = 0.000000
shard_ratio = 0.5002
ep_grouped_bytes_per_layer = ep_ideal_bytes_per_layer = 262144

这意味着 Phase 21 原始硬 gate 已经全部通过：

ep_grouped_bytes_per_layer == ep_ideal_bytes_per_layer
shard_ratio <= 0.52
ep_grouped >= ep_packed × 1.05
ep_grouped >= dense × 2.35
max_abs_diff_grouped < 1e-4

但这不代表问题彻底做完了。
Phase 21 其实同时把下一阶段的边界也量得更清楚了：

grouped 现在之所以能过线，已经明显依赖 resident local cache
down_proj 仍不是完整 grouped GEMM
当前 benchmark 仍是 synthetic layer benchmark，不是完整生成链路

所以从工程判断上，这一轮更像是：

把 PyTorch 级 grouped execution 的收益和代价都讲清楚了。

而不是：

MoE local compute 已经被彻底做完。

我认为这轮最值得拿去面试讲的点

如果我是拿这个阶段去面试，我不会把重点放在“我又提了几个百分点”。
我会更强调这三点：

1. 先判断主瓶颈，再决定阶段主题

Phase 20 已经把 control-plane 量到约 1.94%，所以 Phase 21 没有继续纠缠 split-size helper，而是把目标转向 local expert execution hot path。
这说明阶段推进不是随便往下堆功能，而是有 benchmark 驱动的。

2. 第一版不过线并不可怕，关键是你能不能证明方向对

第一版 grouped contiguous slices 方向是对的，但吞吐只有：

grouped / dense ≈ 2.323x
grouped / packed ≈ 1.011x

如果只看 benchmark，这轮像是“差一点失败”。
但 profiling 已经证明：

index_select / index_copy_ 热点确实被打掉了
batched matmul 已经成主路径

这就意味着下一轮该继续推 local compute 组织，而不是回头怀疑 Phase 20 的通信路径。

3. 把 trade-off 显式写进 benchmark，本身就是工程能力

很多优化项目都会在这里偷懒：

吞吐变快了就写结论
内存代价不提
让 reviewer 自己去翻代码看有没有 cache

我不想这么做。
Phase 21 里，ep_grouped_runtime_resident_ratio = 0.8334 被明确放进 benchmark 输出，这比单纯“又快了一点”更像一个能拿去讲的 infra 项目。

总结

Phase 21 真正完成的，不只是“加了一个 grouped mode”，而是把 MoE EP 的 local expert execution 往前推进了一整层：

从 per-expert scatter/gather 转向 grouped contiguous slices
从 naive local hot path 转向 batched gate/up compute
让正式 benchmark 在保持 exact bytes、true expert sharding 和数值等价的前提下过线
同时把 resident local cache 这笔真实代价显式暴露出来

所以这轮最准确的总结不是：

我把 grouped execution 做出来了。

而是：

我把 grouped execution 做到既能过 benchmark，又能把收益来自哪里、代价是什么、为什么还不算最终 grouped GEMM 讲清楚。

这对推理工程项目来说，比单纯多拿几个百分点更重要。

系列导航

导读：mini-infer系统实战-00-导读：从最小推理链路到 MoE Expert Parallel 的项目路线
上一篇：mini-infer系统实战-23-Control Plane：最难的不是统计出来，而是先量对
下一篇：无