混合量化：面向智能体大语言模型的量化预填充与精确解码

摘要

近期，基于大语言模型（LLM）的智能代理通过规划、工具使用、记忆检索及多步交互，成为解决复杂任务的有效范式。然而，此类代理工作流通常带来显著的输入侧开销，使得计算密集型的预填充阶段成为长上下文、多轮推理中的关键瓶颈。本文提出Mix-Quant，一种简单高效的相位感知量化框架，用于加速代理推理过程。我们首先在代理LLM工作流中研究FP4量化，发现对整个推理过程进行量化会导致显著的性能下降。相比之下，预填充阶段存在大量量化冗余，因此可在精度损失极小的情况下进行量化，尽管该阶段是计算的主要来源。基于这一洞察，我们对预填充阶段应用高吞吐量的NVFP4量化，同时保留解码阶段的BF16精度。通过将预填充加速与解码质量解耦，Mix-Quant将相位感知算法量化与硬件高效的NVFP4执行相结合，缓解LLM代理的推理瓶颈。在长上下文和代理基准上的广泛实验表明，Mix-Quant在基本保持任务性能的同时，实现了显著的效率提升，预填充阶段加速可达3倍。

English

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.