KLペナルティを無視しましょう！重要なトークンの探索を強化してRLファインチューニングを向上させる

要旨

現在の大規模言語モデル（LLM）の開発において、長期目標を達成する能力は重要な課題です。この課題に対処するため、事前学習されたLLMは、強化学習（RL）を用いて微調整され、与えられた目標を最適化する解を探索することができます。しかし、LLMによる探索は困難であり、新しい解を見つけると同時に、基本的な能力を低下させないように、事前学習モデルに近づきすぎないバランスを保つ必要があります。これは通常、Kullback-Leibler（KL）ペナルティによって制御されます。本論文では、単純な算術タスク上で小規模言語モデルの探索ダイナミクスを調査します。事前学習の程度が探索に与える影響を示し、最終結果に劇的な影響を与える「重要トークン」の重要性を実証します。その結果、KLペナルティに簡単な修正を加え、重要トークンにおける探索を優先することで、RL微調整段階の効率を向上させます。

English

The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of "critical tokens" which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.

KLペナルティを無視しましょう！重要なトークンの探索を強化してRLファインチューニングを向上させる

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning

要旨

Support