真实世界人机交互时代:基于用户对话的强化学习
The Era of Real-World Human Interaction: RL from User Conversations
September 29, 2025
作者: Chuanyang Jin, Jing Xu, Bo Liu, Leitian Tao, Olga Golovneva, Tianmin Shu, Wenting Zhao, Xian Li, Jason Weston
cs.AI
摘要
我们认为,要实现模型的持续改进和多维对齐,未来的模型必须从自然的人类互动中学习。当前的对话模型通过预先标注、由专家生成的人类反馈进行对齐。在本研究中,我们引入了基于人类互动的强化学习(RLHI),这一范式直接从真实用户对话中学习。我们开发了两种互补的方法:(1)用户引导重写的RLHI,该方法根据用户的自然语言后续响应修订不满意的模型输出;(2)基于用户奖励的RLHI,该方法通过一个以用户长期互动历史(称为“人物画像”)为条件的奖励模型进行学习。这两种方法通过人物画像条件偏好优化,将长期用户画像与轮次级别的偏好联系起来。在WildChat对话数据上训练后,两种RLHI变体在个性化和指令遵循方面均优于强基线,类似的反馈也提升了推理基准上的表现。这些结果表明,有机的人类互动为个性化对齐提供了可扩展且有效的监督。
English
We posit that to achieve continual model improvement and multifaceted
alignment, future models must learn from natural human interaction. Current
conversational models are aligned using pre-annotated, expert-generated human
feedback. In this work, we introduce Reinforcement Learning from Human
Interaction (RLHI), a paradigm that learns directly from in-the-wild user
conversations. We develop two complementary methods: (1) RLHI with User-Guided
Rewrites, which revises unsatisfactory model outputs based on users'
natural-language follow-up responses, (2) RLHI with User-Based Rewards, which
learns via a reward model conditioned on knowledge of the user's long-term
interaction history (termed persona). Together, these methods link long-term
user personas to turn-level preferences via persona-conditioned preference
optimization. Trained on conversations derived from WildChat, both RLHI
variants outperform strong baselines in personalization and
instruction-following, and similar feedback enhances performance on reasoning
benchmarks. These results suggest organic human interaction offers scalable,
effective supervision for personalized alignment.