튜링 보상을 통한 사용자 시뮬레이터 학습

초록

대화형 환경에서 인간 사용자를 시뮬레이션하는 학습은 에이전트 어시스턴트의 훈련, 개인화 시스템의 평가, 사회과학 연구 등을 발전시킬 수 있다. 기존 접근법은 일반적으로 대규모 언어 모델(LLM)을 훈련하여 단일 정답 응답과 일치시키는 방식으로 수행하는데, 이는 로그 확률을 최대화하거나 유사도 보상을 사용하는 방법을 따른다. 우리는 대신 {Turing-RL}을 제안한다: 사용자 시뮬레이터 모델 훈련을 위한 튜링 테스트 기반 강화 학습 접근법이다. {Turing-RL}은 LLM 판정기를 사용하는 변별적 튜링 보상을 활용하여 생성된 응답이 사용자 이력을 고려할 때 실제 사용자의 응답과 얼마나 구별 불가능한지 점수를 매기며, 사용자 시뮬레이터 LLM은 이러한 보상을 통해 사용자가 말할 수 있었던 것과 구별 불가능한 응답을 생성하도록 학습한다. 대화형 채팅과 Reddit 포럼 토론이라는 두 가지 다른 도메인에서, {Turing-RL}이 LLM 및 인간 평가 지표 모두에서 기준 방법을 일관되게 능가하는 것을 확인했다. 본 연구는 응답 일치보다는 구별 불가능성 최적화가 사용자 시뮬레이터 학습에 효과적임을 시사한다.

English

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.