評估即為關鍵：透過評估設計策略性誇大LLM推理能力

摘要

以Deepseek-R1-Distill系列為代表的推理模型，因其在數學、科學、編程等領域的卓越表現，已被開源社群廣泛採用。然而，我們的研究發現，這些模型的基準評估結果易受多種因素影響而產生顯著波動。評估條件的細微差異，便可能導致結果出現重大變化。類似現象亦見於基於Deepseek-R1-Distill系列微調的其他開源推理模型，以及QwQ-32B模型中，使得其所宣稱的性能提升難以穩定復現。因此，我們倡議建立更為嚴格的模型性能評估範式，並在此基礎上，對Deepseek-R1-Distill系列模型進行了實證評估。

English

Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.

評估即為關鍵：透過評估設計策略性誇大LLM推理能力

Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

摘要

Support