真实世界人际交互时代：基于用户对话的强化学习

摘要

我們提出，要實現模型的持續改進與多維度對齊，未來的模型必須從自然的人類互動中學習。當前的對話模型是通過預先註釋、由專家生成的人類反饋來進行對齊的。在本研究中，我們引入了基於人類互動的強化學習（Reinforcement Learning from Human Interaction, RLHI），這是一種直接從真實用戶對話中學習的範式。我們開發了兩種互補的方法：（1）帶有用戶引導重寫的RLHI，該方法根據用戶的自然語言後續回應來修正不滿意的模型輸出；（2）帶有用戶基於獎勵的RLHI，該方法通過一個基於用戶長期互動歷史（稱為“人物設定”）知識的獎勵模型來學習。這兩種方法共同通過人物設定條件下的偏好優化，將長期用戶人物設定與回合層面的偏好聯繫起來。在基於WildChat對話數據的訓練中，兩種RLHI變體在個性化和指令遵循方面均超越了強基準模型，且類似的反饋也提升了在推理基準測試上的表現。這些結果表明，有機的人類互動為個性化對齊提供了可擴展且有效的監督。

English

We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.

真实世界人际交互时代：基于用户对话的强化学习

The Era of Real-World Human Interaction: RL from User Conversations

摘要

Support