Aprendendo Simuladores de Usuário com Recompensas de Turing

Resumo

Aprender a simular usuários humanos em ambientes interativos poderia avançar o treinamento de assistentes agentes, a avaliação de sistemas de personalização, a pesquisa em ciências sociais e muito mais. Abordagens existentes geralmente fazem isso treinando um modelo de linguagem grande (LLM) para corresponder a uma única resposta de referência, seja maximizando a probabilidade logarítmica ou usando uma recompensa de similaridade. Propomos, em vez disso, o {Turing-RL}: uma abordagem de aprendizado por reforço baseada no Teste de Turing para treinar modelos de simulador de usuário. O {Turing-RL} usa uma recompensa discriminativa de Turing com um juiz LLM para pontuar o quão indistinguível uma resposta gerada é da do usuário real, dado o histórico do usuário, e o LLM simulador de usuário aprende a produzir respostas indistinguíveis do que o usuário poderia ter dito com tais recompensas. Em dois domínios diferentes – chat conversacional e discussão em fórum Reddit – descobrimos que o {Turing-RL} supera consistentemente os métodos de base em métricas de avaliação tanto de LLM quanto humanas. Nosso estudo sugere que otimizar para indistinguibilidade, em vez de correspondência de respostas, é eficaz para aprender simuladores de usuário.

English

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.