利用图灵奖励学习用户模拟器

摘要

在交互式环境中学习模拟人类用户，可以促进智能体助手的训练、个性化系统的评估、社会科学研究等多个领域的发展。现有方法通常通过训练大型语言模型（LLM）来匹配单一的真实回答，要么最大化对数概率，要么使用相似度奖励。我们提出{Turing-RL}：一种基于图灵测试的强化学习方法，用于训练用户模拟器模型。{Turing-RL} 使用具有判别性的图灵奖励，借助 LLM 评判器，根据用户的历史记录来评判生成回答与真实用户回答的不可区分程度，用户模拟器 LLM 据此学习生成与用户可能表达的内容无法区分的回答。在对话聊天和 Reddit 论坛讨论这两个不同领域，我们发现 {Turing-RL} 在 LLM 评估指标和人工评估指标上均持续优于基准方法。我们的研究表明，优化不可区分性而非回答匹配，是学习用户模拟器的有效途径。

English

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.