DualPath: エージェンシックLLM推論におけるストレージ帯域幅ボトルネックの解消

要旨

マルチターンでエージェント的な動作を行うLLM推論の性能は、計算処理ではなく、KVキャッシュのストレージI/Oによって支配される度合いが強まっている。一般的な分散型アーキテクチャでは、大規模なKVキャッシュを外部ストレージからロードする際に、根本的な不均衡が生じる。すなわち、プリフィルエンジン側のストレージNICは帯域幅が飽和状態となる一方で、デコードエンジン側のストレージNICは遊休状態となる。この非対称性がシステム全体のスループットを大きく制約している。　本論文では、このボトルネックを解消する推論システムDualPathを提案する。DualPathは、デュアルパス方式によるKVキャッシュローディングを導入する。従来のストレージからプリフィルエンジンへの経路に加えて、新たなストレージからデコードエンジンへの経路を可能にし、KVキャッシュをデコードエンジンにロードした後、計算ネットワークを介したRDMAによってプリフィルエンジンへ効率的に転送する。DualPathは、この最適化されたデータパス（ネットワーク輻輳を本質的に回避し、レイテンシクリティカルなモデル実行通信との干渉を避ける）と、プリフィルエンジンとデコードエンジン間の負荷を動的に分散するグローバルスケジューラを組み合わせている。　本社内推論システムを用いた、実運用のエージェントワークロードによる3つのモデルでの評価結果は、DualPathがオフライン推論スループットを最大1.87倍向上させることを示している。また、SLOを違反することなく、オンラインサービングスループットを平均1.96倍向上させることも可能である。

English

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87times on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96times without violating SLO.

DualPath: エージェンシックLLM推論におけるストレージ帯域幅ボトルネックの解消

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

要旨

Support