DualPath：突破智能体化LLM推理中的存储带宽瓶颈

摘要

在多轮智能体式大语言模型推理中，性能表现日益由KV缓存存储I/O而非计算能力主导。在主流解耦架构中，从外部存储加载海量KV缓存会引发根本性失衡：预填充引擎的存储网卡因带宽饱和而受限，而解码引擎的存储网卡却处于闲置状态。这种不对称性严重制约了系统整体吞吐量。我们提出DualPath推理系统，通过引入双路径KV缓存加载机制突破此瓶颈。除传统的存储到预填充路径外，DualPath创新性地实现了存储到解码路径——KV缓存先加载至解码引擎，再通过计算网络的RDMA技术高效传输至预填充引擎。该系统将这种可规避网络拥塞、且不与延迟敏感的模型执行通信产生干扰的优化数据路径，与能动态平衡预填充/解码引擎负载的全局调度器相结合。基于生产级智能体工作负载对三种模型的评估表明：DualPath在我们自研的推理系统上可实现离线推理吞吐量最高提升1.87倍，在线服务吞吐量平均提升1.96倍且不违反服务等级协议。

English

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87times on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96times without violating SLO.

DualPath：突破智能体化LLM推理中的存储带宽瓶颈

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

摘要

Support