양자화가 추론에 악영향을 미치는가? 양자화된 추론 모델에 대한 실증적 연구

초록

최근 추론 언어 모델의 발전은 복잡한 작업에서 뛰어난 성능을 보여주고 있지만, 확장된 사고 사슬(chain-of-thought) 추론 과정은 추론 오버헤드를 증가시킵니다. 양자화(quantization)는 대형 언어 모델의 추론 비용을 줄이기 위해 널리 채택되었지만, 추론 모델에 미치는 영향은 아직 충분히 연구되지 않았습니다. 본 연구에서는 양자화된 추론 모델에 대한 첫 번째 체계적인 연구를 수행하며, 1.5B에서 70B 파라미터 범위의 오픈소스 DeepSeek-R1-Distilled Qwen 및 LLaMA 계열 모델과 QwQ-32B를 평가합니다. 우리의 연구는 최신 알고리즘을 사용하여 다양한 비트 폭에서 가중치, KV 캐시 및 활성화 양자화를 다루며, 수학(AIME, MATH-500), 과학(GPQA) 및 프로그래밍(LiveCodeBench) 추론 벤치마크에 걸친 광범위한 평가를 포함합니다. 연구 결과, W8A8 또는 W4A16 양자화를 통해 무손실 양자화가 가능하지만, 더 낮은 비트 폭은 상당한 정확도 위험을 초래한다는 것을 밝혔습니다. 또한 모델 크기, 모델 출처 및 작업 난이도가 성능의 중요한 결정 요인임을 확인했습니다. 예상과 달리, 양자화된 모델은 출력 길이가 증가하지 않았습니다. 또한 모델 크기나 추론 단계를 전략적으로 확장하면 성능을 효과적으로 향상시킬 수 있습니다. 모든 양자화된 모델과 코드는 https://github.com/ruikangliu/Quantized-Reasoning-Models에서 오픈소스로 공개될 예정입니다.

English

Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

양자화가 추론에 악영향을 미치는가? 양자화된 추론 모델에 대한 실증적 연구

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

초록

Support