De Prijs van Anarchie in Gedisaggregeerde Inferentie

Samenvatting

Gedesaggregeerde inferentiearchitecturen scheiden prefill- en decodeerfases fysiek over afzonderlijke GPU-pools, waardoor concurrerende "agenten" ontstaan die een vast hardwarebudget delen. Naar ons beste weten leveren wij de eerste formele speltheoretische analyse van deze architectuur, met NVIDIA Dynamo als concrete casestudy. Wij modelleren gedesaggregeerde serving als drie gekoppelde spelen: een tweespelers-bronnenspel tussen prefill- en decodeerpools, een zelfzuchtig cacheschrijfspel over de hiërarchische KV-cache, en een congestiespel met positieve externaliteiten voor verzoekroutering. De laatste twee valideren wij empirisch; het P/D-bronnenspel wordt analytisch behandeld (paragraaf 9.2). Wij karakteriseren hoe GPU-verzadiging regimeovergangen induceert die de uitbetalingsstructuur van het spel verschuiven: onder verzadiging heeft zelfzuchtig gedrag een begrensde Prijs van Anarchie (PoA); bij verzadiging drijven superlineaire latentie en cache-externaliteiten onze empirische schatter PoA-hat (gedefinieerd in paragraaf 6.4) omhoog. Op basis van deze analyse ontwerpen wij een adaptieve controller die verzadigingsovergangen in realtime detecteert en routeringsparameters dienovereenkomstig aanpast, van cache-affiniteitsexploitatie naar belastingsevenwichtige congestievermijding. Wij implementeren ons raamwerk op een 3-node NVIDIA B200-cluster met Dynamo en twee modellen, Nemotron-4-340B (TP=8, full-node workers met cross-InfiniBand KV-overdrachten) en Llama-3.1-70B (TP=4), en vinden dezelfde drie-regime PoA-hat-structuur met hetzelfde eerste post-knie-roosterpunt (C=128) op beide modellen. Adaptieve routering verschuift elk model naar een beter werkpunt. Ons sterkste resultaat is op de 70B 1P/5D-topologie, waar PoA-hat 3,1x daalt (van 66,4 naar 21,5) in de verzadigde fase bij een doorvoerkost van 13%. Op de 70B 1P/2D daalt PoA-hat 2,2x en TTFT P99 daalt 7,6x (zie paragraaf 8.5).

English

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).