言語モデルがデータから暗黙的に自己改善を学習できるようにする

要旨

大規模言語モデル（LLM）は、オープンエンドのテキスト生成タスクにおいて顕著な能力を発揮しています。しかし、これらのタスクの本質的なオープンエンド性は、モデルの応答品質に常に改善の余地があることを意味します。この課題に対処するため、LLMの性能を向上させるための様々なアプローチが提案されています。特に、LLMが自らの応答品質を自己改善できるようにすることに焦点が当てられており、多様で高品質な訓練データを収集するための大規模な人間のアノテーション作業への依存を減らすことが目指されています。最近では、プロンプトベースの手法がその有効性、効率性、利便性から、自己改善手法の中で広く探求されています。しかし、これらの手法は通常、LLMへの入力として明示的かつ徹底的に記述された評価基準を必要とします。現実世界の複雑な改善目標（例えば、より役に立ち、有害でないこと）に対して、必要なすべての評価基準を手動で導出し提供することは、コストがかかり困難です。この問題を解決するため、我々は人間の選好データから改善目標を暗黙的に学習するImPlicit Self-ImprovemenT（PIT）フレームワークを提案します。PITは、報酬モデルの訓練に使用される選好データのみを必要とし、追加の人間の努力を必要としません。具体的には、人間のフィードバックからの強化学習（RLHF）の訓練目的を再定式化します――与えられた入力に対する応答品質を最大化する代わりに、参照応答を条件とした応答の品質ギャップを最大化します。このようにして、PITは人間の選好により良く整合するという改善目標を暗黙的に訓練されます。2つの実世界のデータセットと1つの合成データセットでの実験により、我々の手法がプロンプトベースの手法を大幅に上回ることが示されました。

English

Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. However, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. To address this challenge, various approaches have been proposed to enhance the performance of LLMs. There has been a growing focus on enabling LLMs to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. Recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. However, those methods usually require explicitly and thoroughly written rubrics as inputs to LLMs. It is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpful and less harmful). To this end, we propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data. PIT only requires preference data that are used to train reward models without extra human efforts. Specifically, we reformulate the training objective of reinforcement learning from human feedback (RLHF) -- instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. In this way, PIT is implicitly trained with the improvement goal of better aligning with human preferences. Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.

言語モデルがデータから暗黙的に自己改善を学習できるようにする

Enable Language Models to Implicitly Learn Self-Improvement From Data

要旨

Support