言語モデルはスカラー報酬なしで言語的フィードバックから学習可能である

要旨

LLM（大規模言語モデル）は、人間やAIからのフィードバックを用いた強化学習（RL）によって訓練されることが多い。しかし、そのような手法では、ニュアンスの豊かなフィードバックがスカラー報酬に圧縮され、その豊かさの多くが失われ、スケールの不均衡が生じる傾向がある。本研究では、言語フィードバックを条件付け信号として扱うことを提案する。テキストから画像を生成する際の言語事前分布に着想を得て、未見のプロンプトから新たな出力を可能にするフィードバック条件付きポリシー（FCP）を導入する。FCPは、応答とフィードバックのペアから直接学習し、オフラインデータに対する最尤訓練を通じてフィードバック条件付き事後分布を近似する。さらに、ポリシーが肯定的な条件下で生成を行い、新たなフィードバックを受けて自身を洗練させるオンラインブートストラップ段階を開発する。これにより、フィードバック駆動型学習は、報酬最適化ではなく条件付き生成として再定義され、LLMが言語フィードバックから直接学習するためのより表現力豊かな方法を提供する。コードはhttps://github.com/sail-sg/feedback-conditional-policyで公開されている。

English

LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.

言語モデルはスカラー報酬なしで言語的フィードバックから学習可能である

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

要旨

Support