LaRA: RLポストトレーニングにおけるデータ汚染検出のための層ごとの表現解析

要旨

強化学習（RL）によるポストトレーニングは、大規模言語モデル（LLM）の推論能力を向上させることが示されている。しかし、RLポストトレーニングにおけるデータ汚染の問題についてはほとんど調査されておらず、これによりトレーニングプロセス自体の汎化性能や評価信頼性が損なわれる可能性がある。既存の検出手法は主に尤度やエントロピーといった出力レベルの信号に依存しているが、RLはトークン尤度ではなく軌跡レベルの報酬を通じて行動を形成するため、RLで訓練されたモデルではこれらの信号の信頼性が低下する。我々は、RLポストトレーニングされたLLMにおける汚染を検出するための層別表現分析フレームワークLaRAを提案する。LaRAは、制御された摂動下での摂動感度、方向性崩壊、局所表現の硬直性を測定する3つの補完的指標を導入する。我々は、汚染が層をまたいで増幅された摂動感度、より強い方向性崩壊、強化された局所硬直性といった漸進的な幾何学的偏差を生じさせることを発見した。この発見に基づき、層と指標にわたる表現レベルの偏差を集約する汚染検出プロトコルも開発する。RLで訓練された推論モデルを用いた実験により、我々のプロトコルが汚染検出において既存の出力レベルベースラインを上回る性能を示すことが明らかになった。

English

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. Based on our findings, we also develop a contamination detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that our protocol outperforms existing output-level baselines for contamination detection.