真实世界人机交互时代：基于用户对话的强化学习

摘要

我们认为，要实现模型的持续改进和多维对齐，未来的模型必须从自然的人类互动中学习。当前的对话模型通过预先标注、由专家生成的人类反馈进行对齐。在本研究中，我们引入了基于人类互动的强化学习（RLHI），这一范式直接从真实用户对话中学习。我们开发了两种互补的方法：（1）用户引导重写的RLHI，该方法根据用户的自然语言后续响应修订不满意的模型输出；（2）基于用户奖励的RLHI，该方法通过一个以用户长期互动历史（称为“人物画像”）为条件的奖励模型进行学习。这两种方法通过人物画像条件偏好优化，将长期用户画像与轮次级别的偏好联系起来。在WildChat对话数据上训练后，两种RLHI变体在个性化和指令遵循方面均优于强基线，类似的反馈也提升了推理基准上的表现。这些结果表明，有机的人类互动为个性化对齐提供了可扩展且有效的监督。

English

We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.

真实世界人机交互时代：基于用户对话的强化学习

The Era of Real-World Human Interaction: RL from User Conversations

摘要

Support