受記憶體限制但不受頻寬限制：批次1大型語言模型解碼中的物理AI推論鴻溝

摘要

物理AI系统，包括机器人、自动驾驶汽车、具身智能体以及边缘副驾，其运行推理工作负载通常与云端大语言模型（LLM）服务不同：采用单流、batch-1的自回归解码模式，即单一机器人、摄像头数据流或用户会话需等待生成下一个token。此类工作负载通常被描述为受内存带宽限制。每个解码步骤需流式加载模型权重与活跃的KV缓存，因此延迟应与峰值HBM带宽成正比。我们表明，这一描述虽正确但不完整。我们测量了三款7至8B级GQA Transformer模型在四种NVIDIA GPU（H100 SXM5、A100-80GB SXM4、L40S及L4）上的batch-1解码性能，评估了2048至16384范围的上下文长度，并在受控的bf16 SDPA设置下产生了44个有效数据点。结果显示，峰值HBM带宽的利用率随峰值带宽提升而下降。以Qwen-2.5-7B在ctx=2048条件下的典型测试为例，L4达到了约81%的分析内存下限，而H100仅达到27%。物理AI解码虽以内存为瓶颈，但更快的存储器并未带来成比例的延迟改善。我们通过CUDA Graphs的A/B实验验证了这一缺失因素。在H100上，ctx=2048条件下，CUDA Graphs在N=10次新会话中使解码延迟提升了1.259倍（95%自助法置信区间为1.253-1.267）。而在L4上，同样操作仅提升1.028倍。这分离出了启动端开销——它在快速GPU上显著可见，但在较慢、受带宽限制的GPU上基本隐藏。部署层面的启示是：内存优化带来的收益仅在运行时实际兑现时才有效。在L4上，bf16解码已接近内存下限，但常见的量化路径并未实现预期的4倍权重流量缩减：bnb-nf4达59.36毫秒/步，AutoAWQ+Marlin达45.24毫秒/步（基线为bf16的62.32毫秒/步）。而采用Ada调优int4内核的GPTQ+ExLlamaV2，则达到了17.36毫秒/步。

English

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.