虛假獎勵悖論:從機制角度理解RLVR如何激活大型語言模型中的記憶捷徑
Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs
January 16, 2026
作者: Lecheng Yan, Ruizhe Li, Guanhua Chen, Qing Li, Jiahui Geng, Wenxi Li, Vincent Wang, Chris Lee
cs.AI
摘要
儘管帶有可驗證獎勵的強化學習(RLVR)對於提升大型語言模型的推理能力極為有效,但近期研究顯示,像Qwen 2.5這樣的模型即使面對虛假或錯誤的獎勵信號仍能取得顯著性能提升。我們深入探究此現象並發現一個「困惑度悖論」:虛假RLVR會觸發一種分歧現象——答案標記的困惑度下降的同時,提示側的連貫性卻惡化,表明模型正在繞過推理過程轉向依賴記憶化。透過路徑修補、Logit透鏡、JSD分析與神經微分方程等技術,我們揭露了一個促成此捷徑的隱藏「錨點-適配器」電路。我們定位出中間層(L18-20)的「功能錨點」會觸發記憶化解答的檢索,其後由高層(L21+)的「結構適配器」對表徵進行轉換以適應捷徑信號。最後我們證明,透過調控該電路中特定MLP鍵的規模,可實現雙向因果導向——人為放大或抑制由數據污染驅動的性能表現。本研究為識別與緩解RLVR調優模型中數據污染問題提供了機制層面的路線圖。程式碼公開於:https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts。
English
Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts.