Mix-Quant：量化預填充，精確解碼於智能體大型語言模型

摘要

LLM 代理近期已成為解決複雜任務的強大範式，透過規劃、工具使用、記憶檢索及多步驟互動來達成目標。然而，此類代理工作流程往往引入大量輸入端的開銷，使得運算密集的預填充階段成為長上下文、多輪推理中的關鍵瓶頸。在本研究中，我們提出 Mix-Quant，一種簡單且有效的相位感知量化框架，用於加速代理推理。我們首先針對 LLM 代理工作流程中的 FP4 量化進行探討，發現對整個推理過程進行量化會導致顯著的性能下降。相較之下，預填充階段表現出大量的量化冗餘，因此可以在精度損失最小的情況下進行量化，儘管它是運算的主要來源。基於此洞察，我們對預填充階段應用高吞吐量的 NVFP4 量化，同時保留 BF16 精度用於解碼階段。透過將預填充加速與解碼品質解耦，Mix-Quant 結合相位感知的演算法量化與硬體高效的 NVFP4 執行，以緩解 LLM 代理中的推理瓶頸。在長上下文與代理基準測試上的廣泛實驗表明，Mix-Quant 能大幅保留任務性能，同時顯著提升效率，在預填充過程中實現高達 3 倍的加速。

English

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.