チューリング報酬によるユーザーシミュレーターの学習

要旨

対話的な環境で人間のユーザーをシミュレートする学習は、エージェントアシスタントの訓練、パーソナライゼーションシステムの評価、社会科学の研究などにおいて進展をもたらす可能性がある。既存のアプローチでは通常、大規模言語モデル（LLM）を訓練し、対数確率の最大化や類似性報酬を用いて単一の正解応答に一致させることでこれを実現している。これに対し我々は、{Turing-RL}を提案する。これはチューリングテストに基づく強化学習手法であり、ユーザーシミュレータモデルを訓練するためのものである。{Turing-RL}は、識別的チューリング報酬をLLM判定器と共に用いて、ユーザーの履歴を考慮した上で、生成された応答が実際のユーザーの発言とどれだけ区別不能かを評価し、ユーザーシミュレータLLMは、そのような報酬を用いてユーザーが実際に発言し得るものと区別不能な応答を生成することを学習する。会話チャットとRedditフォーラムでの議論という2つの異なるドメインにおいて、{Turing-RL}がLLM評価および人間評価の両方の指標でベースライン手法を一貫して上回ることを確認した。本研究は、応答の一致を目指すよりも、区別不能性を最適化することがユーザーシミュレータの学習に有効であることを示唆している。

English

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.