MirrorBench：一个用于评估用户代理拟人化程度的可扩展框架

摘要

大型语言模型（LLMs）正日益被用作人类模拟器，既用于评估对话系统，也用于生成微调数据。然而，简单的"扮演用户"式提示往往会产生冗长、不真实的语句，这凸显了对所谓用户代理智能体进行原则性评估的必要性。我们提出MIRRORBENCH——一个可复现、可扩展的基准测试框架，该框架仅基于用户代理在不同对话任务中生成类人用户语句的能力进行评估，并明确与下游任务成功度解耦。MIRRORBENCH采用模块化执行引擎，具备类型化接口、元数据驱动注册机制、多后端支持、缓存功能和强可观测性。该系统支持可插拔的用户代理、数据集、任务和评估指标，使研究人员能够在统一且考虑方差影响的测试环境中评估任意模拟器。我们包含三类词汇多样性指标（MATTR、YULE'S K和HD-D）和三类基于LLM评判的指标（GTEval、成对不可区分性及规则推理评估）。在四个开放数据集上的测试表明，MIRRORBENCH能生成考虑方差的结果，并系统性地揭示了用户代理与真实人类用户之间的差距。该框架为开源项目，提供简洁的命令行界面用于运行实验、管理配置与缓存以及生成报告。框架访问地址：https://github.com/SAP/mirrorbench。

English

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so-called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata-driven registries, multi-backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance-aware harness. We include three lexical-diversity metrics (MATTR, YULE'S K, and HD-D) and three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason). Across four open datasets, MIRRORBENCH yields variance-aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command-line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at https://github.com/SAP/mirrorbench.

MirrorBench：一个用于评估用户代理拟人化程度的可扩展框架

MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness

摘要

Support