推理還是記憶?數據污染導致強化學習結果不可靠
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
July 14, 2025
作者: Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang
cs.AI
摘要
大型语言模型(LLMs)的推理能力一直是研究的重要焦点。近期研究通过强化学习(RL)进一步提升了这些能力,许多新方法声称在极少或无需外部监督的情况下取得了显著进步。令人惊讶的是,一些研究甚至表明随机或错误的奖励信号也能提升推理性能。然而,这些突破大多在Qwen2.5模型家族上报告,并在MATH-500、AMC和AIME等知名基准测试中评估,而在Llama等其他模型上未能取得类似成果,这值得进一步探究。我们的分析显示,尽管Qwen2.5在数学推理上表现出色,但其在大规模网络语料库上的预训练使其在流行基准测试中易受数据污染的影响。因此,基于这些基准测试得出的结果可能不可靠。为解决这一问题,我们引入了一个生成器,能够生成任意长度和难度的完全合成的算术问题,产生了一个我们称为RandomCalculation的干净数据集。使用这些无泄漏的数据集,我们发现只有准确的奖励信号能持续提升性能,而噪声或错误的信号则不能。我们主张在无污染的基准测试上评估RL方法,并跨越多种模型家族,以确保结论的可信度。
English
The reasoning capabilities of large language models (LLMs) have been a
longstanding focus of research. Recent works have further enhanced these
capabilities using reinforcement learning (RL), with many new methods claiming
significant improvements with minimal or no external supervision. Surprisingly,
some studies even suggest that random or incorrect reward signals can enhance
reasoning performance. However, these breakthroughs are mostly reported on the
Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500,
AMC, and AIME, while failing to achieve similar gains on other models like
Llama, which warrants further investigation. Our analysis shows that although
Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on
large-scale web corpora makes it vulnerable to data contamination in popular
benchmarks. As a result, results derived from these benchmarks may be
unreliable. To address this, we introduce a generator that produces fully
synthetic arithmetic problems of arbitrary length and difficulty, yielding a
clean dataset we call RandomCalculation. Using these leakage-free datasets, we
show that only accurate reward signals consistently improve performance, while
noisy or incorrect signals do not. We advocate for evaluating RL methods on
uncontaminated benchmarks and across diverse model families to ensure
trustworthy conclusions.