RLVF: 과도한 일반화 없이 언어적 피드백으로부터 학습하기

초록

대규모 언어 모델(LLM)이 배포되는 다양한 상황에서는 모델의 기본 동작을 수정하거나 사용자 정의하여 세부적인 요구사항과 선호도를 반영할 수 있는 능력이 필요합니다. 이러한 모델 조정을 지정하기 위한 편리한 인터페이스는 "상사에게 이메일을 작성할 때 이모티콘을 사용하지 마세요"와 같은 고차원적인 언어적 피드백입니다. 그러나 고차원적인 피드백을 작성하는 것이 인간 피드백 강화 학습(RLHF)을 위한 주석을 수집하는 것보다 훨씬 간단하지만, 단순히 모델에 이러한 피드백을 프롬프트로 제공하면 피드백이 관련 없는 상황까지 과도하게 일반화되는 문제가 발생합니다. 우리는 이러한 과도한 일반화 없이 언어적 피드백을 통합하는 문제를 연구하며, 이를 통해 Contextualized Critiques with Constrained Preference Optimization(C3PO)이라는 새로운 방법을 제안합니다. C3PO는 고차원적인 피드백을 사용하여 피드백이 어떻게 적용되어야 하고 적용되지 않아야 하는지를 명시하는 소규모의 합성 선호도 데이터셋을 생성합니다. 그런 다음 합성 선호도 데이터에 따라 모델을 미세 조정하면서 피드백이 적용되지 않는 프롬프트에 대해서는 원래 모델과의 차이를 최소화합니다. 실험 결과는 우리의 접근 방식이 관련 시나리오에 언어적 피드백을 효과적으로 적용하면서 다른 상황에서는 기존의 동작을 보존함을 보여줍니다. 인간과 GPT-4가 생성한 고차원적인 피드백 모두에 대해 C3PO는 컨텍스트 내 베이스라인과 비슷한 수준으로 주어진 피드백을 준수하면서 과도한 일반화를 30% 줄였습니다.

English

The diversity of contexts in which large language models (LLMs) are deployed requires the ability to modify or customize default model behaviors to incorporate nuanced requirements and preferences. A convenient interface to specify such model adjustments is high-level verbal feedback, such as "Don't use emojis when drafting emails to my boss." However, while writing high-level feedback is far simpler than collecting annotations for reinforcement learning from human feedback (RLHF), we find that simply prompting a model with such feedback leads to overgeneralization of the feedback to contexts where it is not relevant. We study the problem of incorporating verbal feedback without such overgeneralization, inspiring a new method Contextualized Critiques with Constrained Preference Optimization (C3PO). C3PO uses a piece of high-level feedback to generate a small synthetic preference dataset specifying how the feedback should (and should not) be applied. It then fine-tunes the model in accordance with the synthetic preference data while minimizing the divergence from the original model for prompts where the feedback does not apply. Our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts. For both human- and GPT-4-generated high-level feedback, C3PO effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%.

RLVF: 과도한 일반화 없이 언어적 피드백으로부터 학습하기

RLVF: Learning from Verbal Feedback without Overgeneralization

초록

Support