Mix-Quant: エージェント型LLMのための量子化プリフィリング、高精度デコード

要旨

LLMエージェントは近年、計画、ツール使用、メモリ検索、マルチステップ対話を通じて複雑なタスクを解決する強力なパラダイムとして登場しました。しかし、これらのエージェント型ワークフローはしばしば入力側の大きなオーバーヘッドを伴い、長コンテキスト・マルチターン推論において計算集約型のプリフィリング段階が主要なボトルネックとなります。本稿では、エージェント型高速推論のためのシンプルかつ効果的なフェーズ認識量子化フレームワークMix-Quantを提案します。まず、エージェント型LLMワークフローにおけるFP4量子化を調査し、推論プロセス全体を量子化すると性能が著しく低下することを観察します。一方、プリフィリング段階は量子化の冗長性が大きく、計算の大部分を占めるにもかかわらず、最小限の精度低下で量子化可能です。この知見に基づき、プリフィリングフェーズには高スループットのNVFP4量子化を適用し、デコーディングにはBF16精度を維持します。プリフィリングの高速化とデコーディングの品質を分離することで、Mix-Quantはフェーズ認識アルゴリズム量子化とハードウェア効率的なNVFP4実行を組み合わせ、LLMエージェントにおける推論ボトルネックを緩和します。長コンテキストおよびエージェント型ベンチマークでの広範な実験により、Mix-Quantはタスク性能をほぼ維持しつつ、プリフィリングで最大3倍の高速化を達成するなど、顕著な効率向上をもたらすことを実証します。

English

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.