解聚推断中的无政府代价

摘要

分解推理架构将预填充与解码阶段物理分离至不同的GPU池，形成共享固定硬件预算的竞争性"代理"。本文以NVIDIA Dynamo为具体案例，首次对该架构进行了形式化博弈论分析（据我们所知）。我们将分解式服务建模为三个耦合博弈：预填充池与解码池之间的双人资源博弈、分层KV缓存上的自私缓存博弈，以及具有正外部性的请求路由拥塞博弈。我们对后两个博弈进行了实证验证，而预填充/解码资源博弈采用解析方法处理（第9.2节）。我们刻画了GPU饱和如何引发状态转换，从而改变博弈的收益结构：在饱和阈值以下，自私行为的无政府价格（PoA）有界；在饱和时，超线性延迟和缓存外部性导致我们的经验估计量PoA-hat（定义见第6.4节）上升。基于此分析，我们设计了一种自适应控制器，可实时检测饱和转换并相应调整路由参数，从缓存亲和性利用切换至负载均衡的拥塞避免。我们在由3个NVIDIA B200节点组成的Dynamo集群上，使用两个模型实例化了该框架：Nemotron-4-340B（张量并行度=8，全节点工作节点，跨InfiniBand KV传输）和Llama-3.1-70B（张量并行度=4），发现两个模型均呈现相同的三区域PoA-hat结构，且第一个后拐点网格点均为C=128。自适应路由使每个模型转移至更优工作点。最强结果体现在70B 1P/5D拓扑结构上：饱和阶段PoA-hat下降3.1倍（从66.4降至21.5），吞吐量成本仅13%。在70B 1P/2D拓扑上，PoA-hat下降2.2倍，TTFT P99下降7.6倍（见第8.5节）。

English

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).