ChatPaper.aiChatPaper

真实世界人机交互时代:基于用户对话的强化学习

The Era of Real-World Human Interaction: RL from User Conversations

September 29, 2025
作者: Chuanyang Jin, Jing Xu, Bo Liu, Leitian Tao, Olga Golovneva, Tianmin Shu, Wenting Zhao, Xian Li, Jason Weston
cs.AI

摘要

我们认为,要实现模型的持续改进和多维对齐,未来的模型必须从自然的人类互动中学习。当前的对话模型通过预先标注、由专家生成的人类反馈进行对齐。在本研究中,我们引入了基于人类互动的强化学习(RLHI),这一范式直接从真实用户对话中学习。我们开发了两种互补的方法:(1)用户引导重写的RLHI,该方法根据用户的自然语言后续响应修订不满意的模型输出;(2)基于用户奖励的RLHI,该方法通过一个以用户长期互动历史(称为“人物画像”)为条件的奖励模型进行学习。这两种方法通过人物画像条件偏好优化,将长期用户画像与轮次级别的偏好联系起来。在WildChat对话数据上训练后,两种RLHI变体在个性化和指令遵循方面均优于强基线,类似的反馈也提升了推理基准上的表现。这些结果表明,有机的人类互动为个性化对齐提供了可扩展且有效的监督。
English
We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.
PDF113September 30, 2025