双路径：突破智能体化大语言模型推理中的存储带宽瓶颈

摘要

在多轮智能体式大模型推理场景中，性能瓶颈正逐渐从计算转向KV缓存存储I/O。在主流 disaggregated 架构中，从外部存储加载海量KV缓存会引发根本性失衡：预填充引擎的存储网卡带宽趋于饱和，而解码引擎的存储网卡却处于闲置状态。这种不对称性严重制约了系统整体吞吐量。我们提出DualPath推理系统，通过引入双路径KV缓存加载机制突破此瓶颈。除传统的存储到预填充路径外，DualPath创新性地开辟了存储到解码路径——将KV缓存加载至解码引擎后，通过计算网络的RDMA技术高效传输至预填充引擎。该系统结合了以下两大优势：一是优化数据路径天然避免网络拥塞，且不会干扰对延迟敏感的模型执行通信；二是全局调度器动态平衡预填充与解码引擎间的负载。基于生产级智能体工作负载对三种模型的测试表明，DualPath在我们自研的推理系统上可实现最高1.87倍的离线推理吞吐量提升。在线服务场景中，在满足SLO要求的前提下，平均还能实现1.96倍的吞吐量提升。

English

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87times on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96times without violating SLO.