메모리 바운드이나 대역폭 제한적이지 않음: 배치-1 LLM 디코딩에서의 물리적 AI 추론 격차

초록

물리적 AI 시스템(로봇, 자율주행 차량, 임베디드 에이전트, 엣지 코파일럿 포함)은 종종 클라우드 LLM 서빙과 다른 추론 워크로드를 실행한다. 즉, 단일 스트림, 배치-1 자기회귀 디코드(single-stream, batch-1 autoregressive decode)로, 하나의 로봇, 카메라 피드 또는 사용자 세션이 다음 토큰을 기다리는 방식이다. 이 워크로드는 일반적으로 메모리 대역폭 제약(memory-bandwidth-bound)으로 설명된다. 각 디코드 단계는 모델 가중치와 활성 KV 캐시를 스트리밍하므로, 지연 시간은 최대 HBM 대역폭에 비례해야 한다. 우리는 이러한 설명이 사실이지만 불완전함을 보인다. 우리는 3개의 7B~8B급 GQA 트랜스포머에 대해 4개의 NVIDIA GPU(H100 SXM5, A100-80GB SXM4, L40S, L4)에서 배치-1 디코드를 측정했다. 컨텍스트 길이는 2048부터 16384까지 평가하여, 통제된 bf16 SDPA 설정에서 44개의 유효한 셀을 생성했다. 최대 HBM 대역폭 대비 달성 비율은 최대 대역폭이 높아질수록 낮아졌다. 주요 사례인 Qwen-2.5-7B ctx=2048 셀에서 L4는 분석적 메모리 하한(analytic memory floor)의 약 81%에 도달한 반면, H100은 27%에 불과했다. 물리적 AI 디코드는 메모리 중심적이지만, 더 빠른 메모리가 비례적인 지연 시간 개선으로 이어지지는 않는다. 우리는 CUDA Graphs A/B 실험을 통해 누락된 항을 테스트했다. H100에서 ctx=2048일 때, CUDA Graphs는 10개의 새로운 세션(N=10)에서 디코드 지연 시간을 1.259배 개선했으며, 95% 부트스트랩 신뢰 구간은 1.253~1.267이었다. L4에서는 동일한 개입이 1.028배의 개선만을 보였다. 이는 빠른 GPU에서 가시화되지만 느린 대역폭 제약 GPU에서는 대부분 숨겨져 있는 런칭 측 오버헤드(launch-side overhead)를 분리한다. 배포 시사점은 메모리 절감이 런타임이 이를 실제로 실현할 때만 의미가 있다는 것이다. L4에서 bf16 디코드는 메모리 하한에 근접하지만, 일반적인 양자화 경로는 예상되는 4배의 가중치 트래픽 감소를 회복하지 못한다. bnb-nf4는 59.36ms/step, AutoAWQ+Marlin은 62.32ms의 bf16 기준선에서 45.24ms/step에 도달한다. Ada-튜닝된 int4 커널을 사용하는 GPTQ+ExLlamaV2는 17.36ms/step에 도달한다.

English

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.