模型與任務的對齊驅動不同的強化學習結果

摘要

近期將強化學習（RL）應用於大型語言模型（LLMs）的研究取得了顯著進展。特別是在LLMs中，一系列引人注目卻常違反直覺的現象被報導出來，這些現象在傳統RL設定中並不常見。例如，有研究聲稱單一訓練樣本即可達到與完整數據集相當的性能，獎勵信號無需非常精確，以及僅使用負樣本進行訓練即可媲美甚至超越基於複雜獎勵的方法。然而，這些觀察結果成立的精確條件——尤其是它們何時失效——仍不明確。在本研究中，我們識別出區分RL觀察結果的一個關鍵因素：預訓練模型是否已展現出強烈的模型-任務對齊性，這通過在評估任務上的pass@k準確率來衡量。通過對一系列違反直覺的聲明進行系統且全面的檢驗，並在不同模型架構和任務領域中進行嚴格的實驗驗證，我們的研究結果表明，雖然標準RL訓練在各設定下始終保持穩健，但許多這些違反直覺的結果僅在模型與任務已展現出強烈對齊性時才會出現。相比之下，在更具挑戰性的情境中，這些技術無法驅動顯著的學習，而標準RL方法仍能保持有效。

English

Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.

模型與任務的對齊驅動不同的強化學習結果

Model-Task Alignment Drives Distinct RL Outcomes

摘要

Support