真实世界人际交互时代:基于用户对话的强化学习
The Era of Real-World Human Interaction: RL from User Conversations
September 29, 2025
作者: Chuanyang Jin, Jing Xu, Bo Liu, Leitian Tao, Olga Golovneva, Tianmin Shu, Wenting Zhao, Xian Li, Jason Weston
cs.AI
摘要
我們提出,要實現模型的持續改進與多維度對齊,未來的模型必須從自然的人類互動中學習。當前的對話模型是通過預先註釋、由專家生成的人類反饋來進行對齊的。在本研究中,我們引入了基於人類互動的強化學習(Reinforcement Learning from Human Interaction, RLHI),這是一種直接從真實用戶對話中學習的範式。我們開發了兩種互補的方法:(1)帶有用戶引導重寫的RLHI,該方法根據用戶的自然語言後續回應來修正不滿意的模型輸出;(2)帶有用戶基於獎勵的RLHI,該方法通過一個基於用戶長期互動歷史(稱為“人物設定”)知識的獎勵模型來學習。這兩種方法共同通過人物設定條件下的偏好優化,將長期用戶人物設定與回合層面的偏好聯繫起來。在基於WildChat對話數據的訓練中,兩種RLHI變體在個性化和指令遵循方面均超越了強基準模型,且類似的反饋也提升了在推理基準測試上的表現。這些結果表明,有機的人類互動為個性化對齊提供了可擴展且有效的監督。
English
We posit that to achieve continual model improvement and multifaceted
alignment, future models must learn from natural human interaction. Current
conversational models are aligned using pre-annotated, expert-generated human
feedback. In this work, we introduce Reinforcement Learning from Human
Interaction (RLHI), a paradigm that learns directly from in-the-wild user
conversations. We develop two complementary methods: (1) RLHI with User-Guided
Rewrites, which revises unsatisfactory model outputs based on users'
natural-language follow-up responses, (2) RLHI with User-Based Rewards, which
learns via a reward model conditioned on knowledge of the user's long-term
interaction history (termed persona). Together, these methods link long-term
user personas to turn-level preferences via persona-conditioned preference
optimization. Trained on conversations derived from WildChat, both RLHI
variants outperform strong baselines in personalization and
instruction-following, and similar feedback enhances performance on reasoning
benchmarks. These results suggest organic human interaction offers scalable,
effective supervision for personalized alignment.