LaRA：面向强化学习后训练中数据污染检测的逐层表示分析方法

摘要

强化学习（RL）后训练已被证明能够提升大型语言模型（LLMs）的推理能力。然而，在RL后训练中数据污染问题却鲜有探索，这可能会损害训练过程本身的泛化能力和评估可靠性。现有检测方法主要依赖于输出级信号（如似然度或熵），但这些方法对于经过RL训练的模型而言并不可靠，因为RL通过轨迹级奖励塑造行为，而非基于标记似然度。我们提出LaRA，一种用于检测RL后训练LLMs中数据污染的层表示分析框架。LaRA引入三项互补指标，分别衡量在受控扰动下的扰动敏感性、方向坍缩度和局部表示刚性。我们发现，数据污染会在各层间产生渐进式的几何偏差，包括扰动敏感性增强、方向坍缩加剧以及局部刚性提升。基于我们的发现，我们还开发了一种污染检测协议，该协议跨层和跨指标聚合表示级偏差。在经RL训练后的推理模型上的实验表明，我们的协议在污染检测方面优于现有的输出级基线方法。

English

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. Based on our findings, we also develop a contamination detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that our protocol outperforms existing output-level baselines for contamination detection.