ChatPaper.aiChatPaper

分離式推論中的無政府代價

The Price of Anarchy in Disaggregated Inference

June 11, 2026
作者: Athos Georgiou
cs.AI

摘要

分離式推理架構將預填充與解碼階段物理分離至不同的 GPU 池中,形成共享固定硬體預算的相互競爭「代理」。我們提供,據我們所知,此架構的首次正式賽局理論分析,並以 NVIDIA Dynamo 作為具體案例研究。我們將分離式服務建模為三個耦合賽局:預填充池與解碼池之間的雙人資源賽局、階層式 KV 快取上的自私快取賽局,以及具有正外部性的請求路由擁塞賽局。我們實證驗證了後兩者;而 P/D 資源賽局則以解析方式處理(第 9.2 節)。我們刻劃 GPU 飽和如何誘發相態轉變,從而改變賽局的報酬結構:在飽和以下,自私行為具有有界的無政府代價(PoA);在飽和時,超線性延遲與快取外部性導致我們的實證估計量 PoA-hat(定義於第 6.4 節)上升。基於此分析,我們設計了一個自適應控制器,可即時偵測飽和轉變並相應調整路由參數,從快取親和性利用轉向負載平衡的擁塞避免。我們在一個由 3 個節點組成的 NVIDIA B200 叢集上實例化我們的框架,該叢集運行 Dynamo,使用兩個模型:Nemotron-4-340B(TP=8,全節點工作節點,支援跨 InfiniBand 的 KV 傳輸)與 Llama-3.1-70B(TP=4),並在兩個模型上觀察到相同的三階段 PoA-hat 結構,且第一個後膝點網格點均為 C=128。自適應路由將每個模型轉移至更佳的操作點。我們最強的結果出現在 70B 1P/5D 拓撲中,在飽和階段 PoA-hat 下降了 3.1 倍(從 66.4 降至 21.5),代價為 13% 的吞吐量損失。在 70B 1P/2D 中,PoA-hat 下降了 2.2 倍,TTFT P99 下降了 7.6 倍(詳見第 8.5 節)。
English
Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).