LaRA: 用於檢測強化學習後訓練中數據污染的逐層表示分析

摘要

強化學習（RL）後訓練已被證實能提升大型語言模型（LLM）的推理能力。然而，針對RL後訓練中數據污染問題的探討仍然不足，這可能損害訓練過程本身的泛化能力與評估可靠性。現有檢測方法主要依賴輸出層級訊號（如似然度或熵值），但對經過RL訓練的模型而言，此類訊號並不可靠——因為RL是透過軌跡層級的獎勵來塑造行為，而非詞元似然度。我們提出LaRA，這是一個基於層級表徵分析的框架，用於檢測RL後訓練LLM中的污染問題。LaRA引入了三種互補性指標，分別測量受控擾動下的擾動敏感性、方向坍縮程度，以及局部表徵剛性。我們發現，污染會在各層之間引發漸進式的幾何偏差，包括放大擾動敏感性、增強方向坍縮，以及提升局部剛性。根據這些發現，我們還開發了一套污染檢測協議，跨層整合層級表徵偏差與多項指標。在經過RL訓練的推理模型上進行的實驗顯示，我們的協議在污染檢測表現上優於現有的輸出層級基準方法。

English

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. Based on our findings, we also develop a contamination detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that our protocol outperforms existing output-level baselines for contamination detection.