ChatPaper.aiChatPaper

虚假奖励悖论:从机制上理解RLVR如何激活LLM中的记忆捷径

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

January 16, 2026
作者: Lecheng Yan, Ruizhe Li, Guanhua Chen, Qing Li, Jiahui Geng, Wenxi Li, Vincent Wang, Chris Lee
cs.AI

摘要

尽管带有可验证奖励的强化学习(RLVR)在增强大语言模型推理能力方面极为有效,但最新研究表明,像Qwen 2.5这样的模型即使面对虚假或错误奖励也能取得显著性能提升。我们深入探究这一现象,发现了"困惑度悖论":虚假RLVR会引发模型行为分化——答案标记的困惑度下降的同时,提示端连贯性却出现退化,表明模型正在绕过推理过程转向记忆化输出。通过路径修补、Logit透镜、JSD分析和神经微分方程等技术,我们揭示了一个促成这种捷径的隐藏锚定-适配器电路。实验定位到中间层(L18-20)存在功能锚点负责触发记忆化解决方案的检索,后续层(L21+)的结构适配器则通过表征转换来适应捷径信号。最后我们证明,通过定向缩放该电路中的特定MLP键向量,可实现双向因果调控——人为放大或抑制由数据污染驱动的性能表现。这项研究为识别和缓解RLVR调优模型中的数据污染问题提供了机制层面的路线图。代码已开源:https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts。
English
Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts.
PDF52January 21, 2026