Geheugengebonden maar niet bandbreedtebeperkt: De fysieke AI-inferentiekloof in Batch-1 LLM-decode

Samenvatting

Fysieke AI-systemen, waaronder robots, autonome voertuigen, belichaamde agenten en edge copilots, draaien vaak een andere inferentiewerklast dan cloud-LLM-servicing: enkelstrooms, batch-1 autoregressief decoderen, waarbij één robot, camerastroom of gebruikerssessie wacht op de volgende token. Deze werklast wordt doorgaans beschreven als geheugenbandbreedtegebonden. Elke decodestap streamt modelgewichten en de actieve KV-cache, dus de latentie zou moeten schalen met de piek-HBM-bandbreedte. We tonen aan dat deze verklaring juist maar onvolledig is. We meten batch-1 decoderen voor drie 7 tot 8B-klasse GQA-transformatoren op vier NVIDIA-GPU's: H100 SXM5, A100-80GB SXM4, L40S en L4. We evalueren contextlengtes van 2048 tot 16384, wat 44 geldige cellen oplevert onder een gecontroleerde bf16-SDPA-opstelling. Het bereikte aandeel van de piek-HBM-bandbreedte daalt naarmate de piekbandbreedte stijgt. In de voorbeeldcel Qwen-2.5-7B ctx=2048 bereikt een L4 ongeveer 81 procent van zijn analytische geheugenvloer, terwijl een H100 slechts 27 procent bereikt. Fysieke-AI-decoderen is geheugengedomineerd, maar sneller geheugen vertaalt zich niet in proportionele latentiewinsten. We testen de ontbrekende term met een CUDA Graphs A/B-experiment. Op H100 bij ctx=2048 verbetert CUDA Graphs de decodelatentie met een factor 1,259x over N=10 verse sessies, met een 95%-bootstrap-betrouwbaarheidsinterval van 1,253 tot 1,267. Op L4 geeft dezelfde ingreep slechts 1,028x. Dit isoleert een overhead aan de lanceringszijde die zichtbaar wordt op snelle GPU's maar grotendeels verborgen blijft op langzamere, bandbreedtegebonden GPU's. De implementatie-implicatie is dat geheugenbesparingen alleen van belang zijn wanneer de runtime ze realiseert. Op L4 ligt bf16-decoderen dicht bij de geheugenvloer, maar gangbare gekwantiseerde paden herstellen niet de verwachte 4x reductie in gewichtsverkeer: bnb-nf4 bereikt 59,36 ms/stap en AutoAWQ+Marlin 45,24 ms/stap vanaf een bf16-baseline van 62,32 ms. GPTQ+ExLlamaV2, met Ada-afgestemde int4-kernels, bereikt 17,36 ms/stap.

English

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.