現実世界の人間的相互作用の時代：ユーザー会話からの強化学習

要旨

継続的なモデルの改善と多面的なアラインメントを達成するためには、将来のモデルが自然な人間の相互作用から学ぶ必要があると我々は提唱する。現在の対話モデルは、事前に注釈付けされた専門家による人間のフィードバックを用いてアラインメントされている。本研究では、実世界のユーザー会話から直接学ぶ「人間の相互作用からの強化学習（Reinforcement Learning from Human Interaction, RLHI）」というパラダイムを導入する。我々は二つの補完的な手法を開発した：(1) ユーザーが自然言語で行うフォローアップ応答に基づいて不満足なモデル出力を修正する「ユーザーガイドによるリライトを用いたRLHI」、(2) ユーザーの長期的な相互作用履歴（ペルソナ）に基づく報酬モデルを通じて学習する「ユーザーベースの報酬を用いたRLHI」である。これらの手法は、ペルソナに基づく選好最適化を通じて、長期的なユーザーペルソナとターンレベルの選好を結びつける。WildChatから得られた会話データを用いて学習した結果、両方のRLHIバリアントは、パーソナライゼーションと指示追従において強力なベースラインを上回り、同様のフィードバックは推論ベンチマークでの性能も向上させた。これらの結果は、有機的な人間の相互作用が、パーソナライズされたアラインメントのためのスケーラブルで効果的な監督を提供することを示唆している。

English

We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.

現実世界の人間的相互作用の時代：ユーザー会話からの強化学習

The Era of Real-World Human Interaction: RL from User Conversations

要旨

Support