模型與任務的對齊驅動不同的強化學習結果
Model-Task Alignment Drives Distinct RL Outcomes
August 28, 2025
作者: Haoze Wu, Cheng Wang, Wenshuo Zhao, Junxian He
cs.AI
摘要
近期將強化學習(RL)應用於大型語言模型(LLMs)的研究取得了顯著進展。特別是在LLMs中,一系列引人注目卻常違反直覺的現象被報導出來,這些現象在傳統RL設定中並不常見。例如,有研究聲稱單一訓練樣本即可達到與完整數據集相當的性能,獎勵信號無需非常精確,以及僅使用負樣本進行訓練即可媲美甚至超越基於複雜獎勵的方法。然而,這些觀察結果成立的精確條件——尤其是它們何時失效——仍不明確。在本研究中,我們識別出區分RL觀察結果的一個關鍵因素:預訓練模型是否已展現出強烈的模型-任務對齊性,這通過在評估任務上的pass@k準確率來衡量。通過對一系列違反直覺的聲明進行系統且全面的檢驗,並在不同模型架構和任務領域中進行嚴格的實驗驗證,我們的研究結果表明,雖然標準RL訓練在各設定下始終保持穩健,但許多這些違反直覺的結果僅在模型與任務已展現出強烈對齊性時才會出現。相比之下,在更具挑戰性的情境中,這些技術無法驅動顯著的學習,而標準RL方法仍能保持有效。
English
Recent advances in applying reinforcement learning (RL) to large language
models (LLMs) have led to substantial progress. In particular, a series of
remarkable yet often counterintuitive phenomena have been reported in LLMs,
exhibiting patterns not typically observed in traditional RL settings. For
example, notable claims include that a single training example can match the
performance achieved with an entire dataset, that the reward signal does not
need to be very accurate, and that training solely with negative samples can
match or even surpass sophisticated reward-based methods. However, the precise
conditions under which these observations hold - and, critically, when they
fail - remain unclear. In this work, we identify a key factor that
differentiates RL observations: whether the pretrained model already exhibits
strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated
task. Through a systematic and comprehensive examination of a series of
counterintuitive claims, supported by rigorous experimental validation across
different model architectures and task domains, our findings show that while
standard RL training remains consistently robust across settings, many of these
counterintuitive results arise only when the model and task already exhibit
strong model-task alignment. In contrast, these techniques fail to drive
substantial learning in more challenging regimes, where standard RL methods
remain effective.