Text2Grad: 자연어 피드백을 통한 강화 학습

초록

기존의 RLHF(Reinforcement Learning from Human Feedback)는 성공 또는 실패의 세부적인 이유를 가리는 거친 스칼라 보상으로 언어 모델을 최적화하여 학습 속도가 느리고 불투명한 문제를 야기했습니다. 최근 연구에서는 프롬프트나 반성을 통해 텍스트 기반 비평을 강화 학습에 추가하여 해석 가능성을 개선했지만, 모델 파라미터는 그대로 유지했습니다. 우리는 자유 형식의 텍스트 피드백을 스팬 수준의 그래디언트로 변환하는 강화 학습 패러다임인 Text2Grad를 소개합니다. Text2Grad는 인간(또는 프로그램적) 비평이 주어지면 각 피드백 구문을 관련 토큰 스팬과 정렬하고, 이러한 정렬을 미분 가능한 보상 신호로 변환하며, 모델 정책의 문제가 되는 부분을 직접 개선하는 그래디언트 업데이트를 수행합니다. 이는 전역적인 조정 대신 정밀하고 피드백에 조건화된 조정을 가능하게 합니다. Text2Grad는 세 가지 구성 요소로 구현됩니다: (1) 비평과 토큰 스팬을 짝짓는 고품질 피드백 주석 파이프라인, (2) 답변에 대한 스팬 수준 보상을 예측하면서 설명적 비평을 생성하는 세밀한 보상 모델, (3) 자연어 그래디언트를 역전파하는 스팬 수준 정책 최적화기. 요약, 코드 생성, 질문 응답 분야에서 Text2Grad는 스칼라 보상 강화 학습과 프롬프트만 사용한 베이스라인을 일관되게 능가하며, 더 높은 작업 지표와 풍부한 해석 가능성을 제공합니다. 우리의 결과는 자연어 피드백이 그래디언트로 변환될 때 세밀한 정책 최적화를 위한 강력한 신호임을 보여줍니다. 우리의 방법에 대한 코드는 https://github.com/microsoft/Text2Grad에서 확인할 수 있습니다.

English

Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at https://github.com/microsoft/Text2Grad

Text2Grad: 자연어 피드백을 통한 강화 학습

Text2Grad: Reinforcement Learning from Natural Language Feedback

초록

Support