分離式推論中的無政府代價

摘要

分離式推理架構將預填充與解碼階段物理分離至不同的 GPU 池中，形成共享固定硬體預算的相互競爭「代理」。我們提供，據我們所知，此架構的首次正式賽局理論分析，並以 NVIDIA Dynamo 作為具體案例研究。我們將分離式服務建模為三個耦合賽局：預填充池與解碼池之間的雙人資源賽局、階層式 KV 快取上的自私快取賽局，以及具有正外部性的請求路由擁塞賽局。我們實證驗證了後兩者；而 P/D 資源賽局則以解析方式處理（第 9.2 節）。我們刻劃 GPU 飽和如何誘發相態轉變，從而改變賽局的報酬結構：在飽和以下，自私行為具有有界的無政府代價（PoA）；在飽和時，超線性延遲與快取外部性導致我們的實證估計量 PoA-hat（定義於第 6.4 節）上升。基於此分析，我們設計了一個自適應控制器，可即時偵測飽和轉變並相應調整路由參數，從快取親和性利用轉向負載平衡的擁塞避免。我們在一個由 3 個節點組成的 NVIDIA B200 叢集上實例化我們的框架，該叢集運行 Dynamo，使用兩個模型：Nemotron-4-340B（TP=8，全節點工作節點，支援跨 InfiniBand 的 KV 傳輸）與 Llama-3.1-70B（TP=4），並在兩個模型上觀察到相同的三階段 PoA-hat 結構，且第一個後膝點網格點均為 C=128。自適應路由將每個模型轉移至更佳的操作點。我們最強的結果出現在 70B 1P/5D 拓撲中，在飽和階段 PoA-hat 下降了 3.1 倍（從 66.4 降至 21.5），代價為 13% 的吞吐量損失。在 70B 1P/2D 中，PoA-hat 下降了 2.2 倍，TTFT P99 下降了 7.6 倍（詳見第 8.5 節）。

English

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).