실세계 인간 상호작용의 시대: 사용자 대화를 통한 강화 학습

초록

지속적인 모델 개선과 다면적 정렬(alignment)을 달성하기 위해서는 미래의 모델이 자연스러운 인간 상호작용에서 학습해야 한다고 가정한다. 현재의 대화형 모델은 사전에 주석이 달린 전문가 생성 인간 피드백을 사용하여 정렬된다. 본 연구에서는 실제 사용자 대화에서 직접 학습하는 인간 상호작용 기반 강화 학습(Reinforcement Learning from Human Interaction, RLHI)이라는 패러다임을 소개한다. 우리는 두 가지 상호 보완적인 방법을 개발하였다: (1) 사용자 주도 재작성(RLHI with User-Guided Rewrites)은 사용자의 자연어 후속 응답을 기반으로 만족스럽지 않은 모델 출력을 수정하며, (2) 사용자 기반 보상(RLHI with User-Based Rewards)은 사용자의 장기적 상호작용 이력(페르소나)을 조건으로 한 보상 모델을 통해 학습한다. 이 두 방법은 페르소나 조건부 선호 최적화(persona-conditioned preference optimization)를 통해 장기적 사용자 페르소나와 턴 단위 선호도를 연결한다. WildChat에서 도출된 대화 데이터로 학습된 두 RLHI 변형은 개인화 및 지시 따르기 작업에서 강력한 기준선을 능가하며, 유사한 피드백은 추론 벤치마크에서도 성능을 향상시킨다. 이러한 결과는 유기적인 인간 상호작용이 개인화된 정렬을 위한 확장 가능하고 효과적인 감독을 제공함을 시사한다.

English

We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.

실세계 인간 상호작용의 시대: 사용자 대화를 통한 강화 학습

The Era of Real-World Human Interaction: RL from User Conversations

초록

Support