評価こそがすべて：評価設計を通じたLLM推論能力の戦略的過大評価

要旨

Deepseek-R1-Distillシリーズに代表される推論モデルは、数学、科学、プログラミングなどの分野で高い性能を発揮することから、オープンソースコミュニティで広く採用されています。しかし、我々の研究によれば、これらのモデルのベンチマーク評価結果は、様々な要因によって大きく変動することが明らかになりました。評価条件の微妙な違いが、結果に大きなばらつきを引き起こすのです。同様の現象は、Deepseek-R1-Distillシリーズを基にファインチューニングされた他のオープンソース推論モデルや、QwQ-32Bモデルでも観察されており、それらが主張する性能向上を確実に再現することが困難です。そのため、我々はモデル性能評価のためのより厳格なパラダイムの確立を提唱し、Deepseek-R1-Distillシリーズモデルに対する我々の実証的評価を提示します。

English

Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.

評価こそがすべて：評価設計を通じたLLM推論能力の戦略的過大評価

Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

要旨

Support