评估即王道：通过评估设计策略性夸大LLM推理能力

摘要

以Deepseek-R1-Distill系列为代表的推理模型，因其在数学、科学、编程等领域的卓越表现，已被开源社区广泛采用。然而，我们的研究发现，这些模型的基准评估结果受多种因素影响，存在显著波动。评估条件的细微差异可能导致结果的大幅变化。类似现象也出现在基于Deepseek-R1-Distill系列微调的其他开源推理模型以及QwQ-32B模型中，使得它们宣称的性能提升难以稳定复现。因此，我们倡导建立更为严格的模型性能评估范式，并在此分享我们对Deepseek-R1-Distill系列模型的实证评估。

English

Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.

评估即王道：通过评估设计策略性夸大LLM推理能力

Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

摘要

Support