メモリ律速だが帯域幅律速ではない：バッチ1 LLMデコードにおける物理的AI推論ギャップ

要旨

ロボット、自動運転車、具現化エージェント、エッジコパイロットなどの物理AIシステムは、クラウドLLMサービスとは異なる推論ワークロード、すなわちシングルストリーム、バッチ1の自己回帰デコードを実行することが多い。この方式では、1台のロボット、カメラフィード、またはユーザーセッションが次のトークンを待機する。このワークロードは通常、メモリ帯域幅律速であると説明される。各デコードステップではモデル重みとアクティブなKVキャッシュがストリームされるため、レイテンシはピークHBM帯域幅に比例すると考えられる。本稿では、この説明は正しいが不完全であることを示す。我々は、7〜8Bクラスの3つのGQAトランスフォーマーについて、4種類のNVIDIA GPU（H100 SXM5、A100-80GB SXM4、L40S、L4）でバッチ1デコードを測定した。コンテキスト長を2048から16384まで評価し、制御されたbf16 SDPA設定のもとで44の有効なセルを生成した。達成されたピークHBM帯域幅の割合は、ピーク帯域幅が高くなるにつれて低下する。代表的なQwen-2.5-7B、ctx=2048のセルでは、L4は分析上のメモリフロアの約81%に達するのに対し、H100はわずか27%にしか達しない。物理AIデコードはメモリ支配的であるが、高速なメモリは比例したレイテンシ向上にはつながらない。我々はこの欠落項をCUDA GraphsのA/B実験で検証する。H100のctx=2048では、CUDA GraphsはN=10の新しいセッション全体でデコードレイテンシを1.259倍改善し、95%ブートストラップ信頼区間は1.253から1.267である。L4では、同じ介入でわずか1.028倍の改善である。これにより、高速GPUでは顕在化するが、低速で帯域幅律速のGPUではほとんど隠れたままとなる起動側のオーバーヘッドが特定される。導入への示唆として、メモリ節約はランタイムがそれを実現した場合にのみ意味を持つ。L4では、bf16デコードはメモリフロアに近い位置にあるが、一般的な量子化パスでは期待される4倍の重みトラフィック削減は達成されない。bnb-nf4は59.36 ms/step、AutoAWQ+Marlinは45.24 ms/stepであり、bf16ベースラインの62.32 msから低下している。GPTQ+ExLlamaV2は、Ada調整済みint4カーネルにより、17.36 ms/stepに達する。

English

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.