量子化は推論能力を損なうか？量子化推論モデルに関する実証的研究

要旨

最近の推論言語モデルの進歩は、複雑なタスクにおいて顕著な性能を示していますが、その拡張された連鎖思考推論プロセスは推論のオーバーヘッドを増加させます。量子化は大規模言語モデルの推論コストを削減するために広く採用されていますが、推論モデルへのその影響は十分に研究されていません。本研究では、量子化された推論モデルに関する最初の体系的な研究を行い、1.5Bから70BパラメータまでのオープンソースのDeepSeek-R1-Distilled QwenおよびLLaMAファミリー、およびQwQ-32Bを評価しました。私たちの調査は、最先端のアルゴリズムを使用した重み、KVキャッシュ、および活性化の量子化を様々なビット幅でカバーし、数学的（AIME、MATH-500）、科学的（GPQA）、およびプログラミング（LiveCodeBench）の推論ベンチマークにわたる広範な評価を行いました。私たちの調査結果は、W8A8またはW4A16量子化でロスレス量子化が達成可能である一方、より低いビット幅では精度リスクが顕著に増加することを明らかにしました。さらに、モデルサイズ、モデルの起源、およびタスクの難易度が性能の重要な決定要因であることを特定しました。予想に反して、量子化されたモデルは出力長の増加を示しませんでした。加えて、モデルサイズや推論ステップを戦略的にスケーリングすることで、効果的に性能を向上させることができます。すべての量子化されたモデルとコードはhttps://github.com/ruikangliu/Quantized-Reasoning-Modelsでオープンソースとして公開されます。

English

Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

量子化は推論能力を損なうか？量子化推論モデルに関する実証的研究

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

要旨

Support