基于多样性引导用户模拟的高效智能体评估方法

摘要

大型语言模型（LLMs）作为面向客户的交互代理正日益普及，但由于其随机性多轮对话的特性，评估其可靠性仍具挑战。当前评估方案依赖对完整人机对话进行线性蒙特卡洛推演来估算成功率，但该方法存在计算效率低下的问题——需要反复生成相同的对话前缀，且难以捕捉由罕见用户行为引发的深层故障模式。我们提出DIVERT（基于轨迹分支的多样性诱导评估框架），一种高效的、基于快照的、覆盖导向的用户模拟框架，用于系统化探索人机交互。该框架在关键决策点捕获完整的智能体-环境状态，并从这些快照恢复执行，实现共享对话前缀的复用，减少冗余计算。系统在每个决策节点通过具有针对性的多样性诱导用户响应进行分支，从而实现对替代交互路径的定向探索。通过将评估重点聚焦于语义多样性且未被充分探索的对话轨迹，DIVERT在提升效率的同时扩大了覆盖范围。实证结果表明，与标准线性推演方案相比，该框架在单位计算量内能发现更多故障案例，同时扩展了可识别故障的任务范围。

English

Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent-user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), an efficient, snapshot-based, coverage-guided user simulation framework for systematic exploration of agent-user interactions. DIVERT captures the full agent-environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant computation. From each junction, the framework branches using targeted, diversity-inducing user responses, allowing directed exploration of alternative interaction paths. By focusing evaluation on semantically diverse and underexplored trajectories, DIVERT improves both efficiency and coverage. Empirical results show that it discovers more failures per token compared to standard linear rollout protocols, while expanding the set of tasks on which failures are identified.