评估即王道:通过评估设计策略性夸大LLM推理能力
Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design
June 5, 2025
作者: Lin Sun, Weihong Lin, Jinzhu Wu, Yongfu Zhu, Xiaoqi Jian, Guangxiang Zhao, Change Jia, Linglin Zhang, Sai-er Hu, Yuhan Wu, Xiangzheng Zhang
cs.AI
摘要
以Deepseek-R1-Distill系列为代表的推理模型,因其在数学、科学、编程等领域的卓越表现,已被开源社区广泛采用。然而,我们的研究发现,这些模型的基准评估结果受多种因素影响,存在显著波动。评估条件的细微差异可能导致结果的大幅变化。类似现象也出现在基于Deepseek-R1-Distill系列微调的其他开源推理模型以及QwQ-32B模型中,使得它们宣称的性能提升难以稳定复现。因此,我们倡导建立更为严格的模型性能评估范式,并在此分享我们对Deepseek-R1-Distill系列模型的实证评估。
English
Reasoning models represented by the Deepseek-R1-Distill series have been
widely adopted by the open-source community due to their strong performance in
mathematics, science, programming, and other domains. However, our study
reveals that their benchmark evaluation results are subject to significant
fluctuations caused by various factors. Subtle differences in evaluation
conditions can lead to substantial variations in results. Similar phenomena are
observed in other open-source inference models fine-tuned based on the
Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their
claimed performance improvements difficult to reproduce reliably. Therefore, we
advocate for the establishment of a more rigorous paradigm for model
performance evaluation and present our empirical assessments of the
Deepseek-R1-Distill series models.