推理还是记忆？数据污染导致的强化学习结果不可靠

摘要

大型语言模型（LLMs）的推理能力一直是研究的重要焦点。近期工作通过强化学习（RL）进一步提升了这些能力，许多新方法声称在极少或无需外部监督的情况下取得了显著进步。令人惊讶的是，一些研究甚至表明，随机或错误的奖励信号也能增强推理性能。然而，这些突破主要是在Qwen2.5模型系列上报告的，并在MATH-500、AMC和AIME等知名基准上进行了评估，而在Llama等其他模型上未能实现类似的提升，这值得进一步研究。我们的分析显示，尽管Qwen2.5在数学推理上表现出色，但其在大规模网络语料上的预训练使其容易受到流行基准中数据污染的影响。因此，基于这些基准得出的结果可能不可靠。为解决这一问题，我们引入了一个生成器，能够生成任意长度和难度的完全合成的算术问题，从而得到一个我们称为RandomCalculation的干净数据集。利用这些无泄漏的数据集，我们发现只有准确的奖励信号能持续提升性能，而噪声或错误的信号则不能。我们主张在无污染的基准上评估RL方法，并跨多种模型系列进行测试，以确保结论的可信度。

English

The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.

推理还是记忆？数据污染导致的强化学习结果不可靠

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

摘要

Support