언어 모델은 스칼라 보상 없이도 언어적 피드백으로부터 학습할 수 있다

초록

LLM(Large Language Model)은 종종 인간 또는 AI 피드백을 통한 강화 학습(RL)으로 훈련되지만, 이러한 방법들은 일반적으로 미묘한 피드백을 스칼라 보상으로 압축하여 그 풍부함을 상당 부분 잃고 스케일 불균형을 유발합니다. 우리는 언어적 피드백을 조건 신호로 취급하는 방식을 제안합니다. 텍스트-이미지 생성에서 언어 사전 정보가 보이지 않는 프롬프트로부터 새로운 출력을 가능하게 하는 것에서 영감을 받아, 피드백 조건 정책(FCP)을 소개합니다. FCP는 응답-피드백 쌍에서 직접 학습하며, 오프라인 데이터에 대한 최대 가능도 훈련을 통해 피드백 조건 사후 분포를 근사합니다. 또한, 정책이 긍정적인 조건에서 생성하고 새로운 피드백을 받아 스스로를 개선하는 온라인 부트스트래핑 단계를 개발합니다. 이는 피드백 주도 학습을 보다 표현력 있는 방식으로 재구성하여, LLM이 언어적 피드백에서 직접 학습할 수 있도록 합니다. 우리의 코드는 https://github.com/sail-sg/feedback-conditional-policy에서 확인할 수 있습니다.

English

LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.

언어 모델은 스칼라 보상 없이도 언어적 피드백으로부터 학습할 수 있다

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

초록

Support