ChatPaper.aiChatPaper

解聚推断中的无政府代价

The Price of Anarchy in Disaggregated Inference

June 11, 2026
作者: Athos Georgiou
cs.AI

摘要

分解推理架构将预填充与解码阶段物理分离至不同的GPU池,形成共享固定硬件预算的竞争性"代理"。本文以NVIDIA Dynamo为具体案例,首次对该架构进行了形式化博弈论分析(据我们所知)。我们将分解式服务建模为三个耦合博弈:预填充池与解码池之间的双人资源博弈、分层KV缓存上的自私缓存博弈,以及具有正外部性的请求路由拥塞博弈。我们对后两个博弈进行了实证验证,而预填充/解码资源博弈采用解析方法处理(第9.2节)。我们刻画了GPU饱和如何引发状态转换,从而改变博弈的收益结构:在饱和阈值以下,自私行为的无政府价格(PoA)有界;在饱和时,超线性延迟和缓存外部性导致我们的经验估计量PoA-hat(定义见第6.4节)上升。基于此分析,我们设计了一种自适应控制器,可实时检测饱和转换并相应调整路由参数,从缓存亲和性利用切换至负载均衡的拥塞避免。我们在由3个NVIDIA B200节点组成的Dynamo集群上,使用两个模型实例化了该框架:Nemotron-4-340B(张量并行度=8,全节点工作节点,跨InfiniBand KV传输)和Llama-3.1-70B(张量并行度=4),发现两个模型均呈现相同的三区域PoA-hat结构,且第一个后拐点网格点均为C=128。自适应路由使每个模型转移至更优工作点。最强结果体现在70B 1P/5D拓扑结构上:饱和阶段PoA-hat下降3.1倍(从66.4降至21.5),吞吐量成本仅13%。在70B 1P/2D拓扑上,PoA-hat下降2.2倍,TTFT P99下降7.6倍(见第8.5节)。
English
Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).