受内存限制而非带宽受限：批量大小为1的LLM解码中的物理AI推理差距

摘要

物理AI系统，包括机器人、自动驾驶汽车、具身智能体以及边缘副驾驶，通常执行与云端大语言模型（LLM）推理不同的计算负载：单流、batch-1自回归解码，即每个机器人、摄像头数据流或用户会话需要等待下一个词元的生成。这种负载通常被视为受内存带宽限制。每一步解码都会流式加载模型权重和活跃的键值缓存（KV cache），因此延迟应与峰值高带宽内存（HBM）带宽成正比。我们证明这种观点虽然正确，但不够全面。我们针对三款7至8B级别的分组查询注意力（GQA）变换器，在四款英伟达GPU（H100 SXM5、A100-80GB SXM4、L40S和L4）上测量了batch-1解码性能。评估的上下文长度范围为2048至16384，在受控的bf16 SDPA（稀疏注意力机制）设置下得出了44个有效数据点。结果表明，达到的峰值HBM带宽比例随峰值带宽提升而下降。以头版结果为例，在Qwen-2.5-7B模型且上下文长度=2048时，L4 GPU达到了约81%的分析内存下限，而H100仅达到27%。物理AI解码虽以内存为主导，但更快的显存并未带来等比例的延迟降低。我们通过CUDA Graphs的A/B实验检验了缺失的环节。在H100上且上下文长度=2048时，CUDA Graphs将解码延迟提升了1.259倍（N=10个全新会话，95%自助法置信区间为1.253至1.267）。在L4上，同一干预仅带来1.028倍的提升。这表明存在启动开销，在高速GPU上清晰可见，但在较慢、受带宽限制的GPU上基本被掩盖。实际部署的启示是：内存节省仅在运行时能体现时才有意义。在L4上，bf16解码已接近内存下限，但常见的量化路径并未实现预期的4倍权重流量缩减：从62.32毫秒/步的bf16基线降至bnb-nf4的59.36毫秒/步和AutoAWQ+Marlin的45.24毫秒/步。而采用经过Ada调优的int4内核的GPTQ+ExLlamaV2则达到了17.36毫秒/步。

English

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.