DualPath: 에이전트형 LLM 추론의 저장 대역폭 병목 현상 해결

초록

다중 턴 에이전시 LLM 추론의 성능은 점점 연산보다는 KV 캐시 저장소 I/O에 의해 좌우되고 있습니다. 널리 사용되는 분산 아키텍처에서 방대한 KV 캐시를 외부 저장소에서 로드하는 것은 근본적인 불균형을 초래합니다: 프리필 엔진의 저장소 NIC는 대역폭 포화 상태가 되는 반면, 디코딩 엔진의 NIC는 유휴 상태로 남습니다. 이러한 비대칭성은 전체 시스템 처리량을 심각하게 제한합니다. 본 논문에서는 이중 경로 KV 캐시 로딩을 도입하여 이러한 병목 현상을 해결하는 추론 시스템인 DualPath를 제안합니다. 기존의 저장소-프리필 경로를 넘어서, DualPath는 새로운 저장소-디코드 경로를 가능하게 합니다. 이 경로에서는 KV 캐시가 디코딩 엔진에 로드된 후 컴퓨팅 네트워크를 통한 RDMA를 통해 프리필 엔진으로 효율적으로 전송됩니다. DualPath는 네트워크 혼잡을 본질적으로 회피하고 지연 시간에 민감한 모델 실행 통신 간섭을 방지하는 이 최적화된 데이터 경로를, 프리필 및 디코드 엔진 간의 부하를 동적으로 분산시키는 글로벌 스케줄러와 결합합니다. 실제 에이전시 워크로드를 사용한 세 가지 모델에 대한 평가 결과, DualPath는 자체 추론 시스템에서 오프라인 추론 처리량을 최대 1.87배 향상시키는 것으로 나타났습니다. 또한 SLO를 위반하지 않으면서 온라인 서빙 처리량을 평균 1.96배 향상시킬 수 있습니다.

English

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87times on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96times without violating SLO.

DualPath: 에이전트형 LLM 추론의 저장 대역폭 병목 현상 해결

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

초록

Support