利用圖靈獎勵學習用戶模擬器

摘要

在交互环境中學習模擬人類使用者，可推動代理助手的訓練、個人化系統的評估、社會科學研究等更多領域的發展。現有方法通常透過訓練大型語言模型（LLM）來比對單一真實回應，方式包括最大化對數概率或使用相似度獎勵。相反地，我們提出{Turing-RL}：一種基於圖靈測試的強化學習方法，用於訓練使用者模擬器模型。{Turing-RL}利用具有LLM裁判的區分性圖靈獎勵，根據使用者的歷史記錄，對生成回應與真實使用者回應的不可區分程度進行評分；使用者模擬器LLM則學習在該獎勵下產生與使用者可能陳述難以區分的回應。在對話聊天與Reddit論壇討論這兩個不同領域中，我們發現{Turing-RL}在LLM評估與人類評估指標上，皆持續優於基線方法。我們的研究表明，相較於回應比對，最佳化不可區分性對於學習使用者模擬器更為有效。

English

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.