Mix-Quant: 양자화된 프리필링, 정밀한 디코딩을 위한 에이전트 LLM

초록

LLM 에이전트는 최근 계획 수립, 도구 사용, 메모리 검색 및 다단계 상호작용을 통해 복잡한 작업을 해결하는 강력한 패러다임으로 부상하고 있다. 그러나 이러한 에이전트 워크플로우는 종종 상당한 입력 측 오버헤드를 유발하여, 계산 집약적인 프리필링 단계가 긴 컨텍스트의 다중 턴 추론에서 주요 병목 현상이 된다. 본 연구에서는 빠른 에이전트 추론을 위한 간단하고 효과적인 단계 인식 정량화 프레임워크인 Mix-Quant를 제안한다. 먼저 에이전트 LLM 워크플로우에서 FP4 정량화를 조사하고, 전체 추론 과정을 정량화하면 상당한 성능 저하가 발생할 수 있음을 관찰한다. 대조적으로, 프리필링 단계는 상당한 정량화 중복성을 보여주므로, 계산의 주요 원천임에도 불구하고 최소한의 정확도 손실로 정량화될 수 있다. 이러한 통찰을 바탕으로, 디코딩을 위해 BF16 정밀도를 유지하면서 프리필링 단계에 높은 처리량의 NVFP4 정량화를 적용한다. 프리필링 가속을 디코딩 품질에서 분리함으로써, Mix-Quant는 단계 인식 알고리즘 정량화와 하드웨어 효율적인 NVFP4 실행을 결합하여 LLM 에이전트의 추론 병목 현상을 완화한다. 긴 컨텍스트 및 에이전트 벤치마크에 걸친 광범위한 실험을 통해 Mix-Quant가 작업 성능을 대부분 유지하면서 상당한 효율성 향상을 제공하며, 프리필링 중 최대 3배 속도 향상을 달성함을 입증한다.

English

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.