逆の視点からの洞察: 逆強化学習を通じたLLMトレーニング目標の再構築

要旨

強化学習からの人間フィードバック（RLHF）で訓練された大規模言語モデル（LLMs）は、顕著な能力を示していますが、その根底にある報酬関数と意思決定プロセスは不透明です。本論文では、逆強化学習（IRL）を適用してLLMsを解釈する革新的な手法を紹介します。私たちは、さまざまなサイズの有毒性に整合したLLMsで実験を行い、人間の選好を予測する際に最大80.40％の精度を達成する報酬モデルを抽出します。我々の分析は、報酬関数の非同一性、モデルサイズと解釈可能性の関係、RLHFプロセスにおける潜在的な落とし穴に関する重要な洞察を明らかにします。IRLによって導出された報酬モデルを使用して新しいLLMsを微調整することができ、有毒性ベンチマークでの比較可能または向上したパフォーマンスを実現します。この研究は、LLMの整合性を理解し改善するための新しい視点を提供し、これらの強力なシステムの責任ある開発と展開に影響を与えます。

English

Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 80.40% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.

逆の視点からの洞察: 逆強化学習を通じたLLMトレーニング目標の再構築

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

要旨

Support