語言模型能夠在無需標量獎勵的情況下，從口頭反饋中學習。

摘要

大型語言模型（LLMs）通常通過來自人類或AI的反饋進行強化學習（RL）訓練，然而此類方法通常將細緻的反饋壓縮為標量獎勵，丟失了其豐富性並導致尺度失衡。我們提出將言語反饋視為條件信號。受文本到圖像生成中語言先驗的啟發，該先驗能夠從未見提示中產生新穎輸出，我們引入了反饋條件策略（FCP）。FCP直接從回應-反饋對中學習，通過對離線數據的最大似然訓練來近似反饋條件後驗。我們進一步開發了一個在線引導階段，在此階段中，策略在積極條件下生成並接收新的反饋以自我完善。這將反饋驅動的學習重新定義為條件生成而非獎勵優化，為LLMs提供了一種更富表現力的方式來直接從言語反饋中學習。我們的代碼可在https://github.com/sail-sg/feedback-conditional-policy獲取。

English

LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.

語言模型能夠在無需標量獎勵的情況下，從口頭反饋中學習。

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

摘要

Support