檢測大型語言模型強化學習後訓練中的數據污染問題
Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models
October 10, 2025
作者: Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li
cs.AI
摘要
數據污染對大型語言模型(LLMs)的可靠評估構成了重大威脅。當基準樣本無意中出現在訓練集中時,這一問題便會出現,從而損害報告性能的有效性。儘管已經開發了針對預訓練和監督微調階段的檢測方法,但在日益重要的強化學習(RL)後訓練階段,仍存在一個關鍵的研究空白。隨著RL後訓練成為推進LLM推理的關鍵,這一範式中缺乏專門的污染檢測方法,暴露了一個嚴重的脆弱性。為解決這一問題,我們首次在RL後訓練情境下系統地研究了數據檢測,並提出了自我批判(Self-Critique)方法。我們的方法基於一個關鍵觀察:在RL階段後,LLMs的輸出熵分佈往往會崩潰為高度特定且稀疏的模式。自我批判探測了潛在的策略崩潰,即模型收斂到一個狹窄的推理路徑,這導致了熵的減少。為了促進這項研究,我們還引入了RL-MIA,這是一個為模擬這一特定污染場景而構建的基準。大量實驗表明,自我批判在多個模型和污染任務中顯著優於基線方法,AUC提升高達30%。而現有方法對於RL階段的污染檢測幾乎接近隨機猜測,我們的方法則使檢測成為可能。
English
Data contamination poses a significant threat to the reliable evaluation of
Large Language Models (LLMs). This issue arises when benchmark samples may
inadvertently appear in training sets, compromising the validity of reported
performance. While detection methods have been developed for the pre-training
and Supervised Fine-Tuning stages, a critical research gap exists for the
increasingly significant phase of Reinforcement Learning (RL) post-training. As
RL post-training becomes pivotal for advancing LLM reasoning, the absence of
specialized contamination detection methods in this paradigm presents a
critical vulnerability. To address this, we conduct the first systematic study
of data detection within RL post-training scenario and propose Self-Critique.
Our method is motivated by a key observation: after RL phase, the output
entropy distribution of LLMs tends to collapse into highly specific and sparse
modes. Self-Critique probes for the underlying policy collapse, i.e., the
model's convergence to a narrow reasoning path, which causes this entropy
reduction. To facilitate this research, we also introduce RL-MIA, a benchmark
constructed to simulate this specific contamination scenario. Extensive
experiments show that Self-Critique significantly outperforms baseline methods
across multiple models and contamination tasks, achieving an AUC improvement of
up to 30%. Whereas existing methods are close to a random guess for RL-phase
contamination, our method makes detection possible.