忽略KL懲罰！通過提升關鍵標記的探索來增強強化學習微調。

摘要

在當前大型語言模型（LLMs）的發展中，實現長期目標的能力是一個關鍵挑戰。為了應對這一挑戰，可以通過使用強化學習（RL）對預訓練的LLMs進行微調，以探索優化特定目標的解決方案。然而，LLMs的探索是困難的，因為需要在發現新解決方案和保持足夠接近預訓練模型之間取得平衡，以避免降低基本能力。通常透過Kullback-Leibler（KL）懲罰來控制這一平衡。本文研究了一個簡單算術任務上的小型語言模型的探索動態。我們展示了不同程度的預訓練如何影響探索，並展示了“關鍵標記”的重要性，這對最終結果產生了戲劇性影響。因此，我們引入了一個簡單的修改，使KL懲罰更有利於對關鍵標記的探索，從而提高了RL微調階段的效率。

English

The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of "critical tokens" which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.

忽略KL懲罰！通過提升關鍵標記的探索來增強強化學習微調。

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning

摘要

Support