基准测试：瞬息全宇宙

摘要

基准测试通过提供标准化且明确的性能度量，对于评估和推进大语言模型及多模态大语言模型至关重要。然而，其构建过程劳动密集且难以复用，引发了对可持续性和可扩展性的担忧。此外，现有基准测试在发布后往往迅速达到性能饱和，导致对先进模型的区分能力不足。为应对这些挑战，我们提出基准测试智能体（Benchmark Agent），一种专用于基准构建的完全自主智能体系统。该框架统筹管理从用户查询分析、子任务设计到数据标注和质量控制的完整基准构建流程。为评估基准测试智能体，我们将其应用于生成15个代表性基准，涵盖文本理解、多模态理解及领域特定推理等多种评估场景。通过人类评估、大语言模型作为裁判的评估以及一致性检查等大量实验表明，基准测试智能体能够以最少人工参与生成高质量的基准样本。更重要的是，通过持续评估，我们观察到若干富有洞见的发现，包括当前模型在特定领域推理任务中仍存在困难。我们相信，快速演进的基准测试将为研究社区做出重要贡献。预览版本和代码将在演示页面和代码仓库中公开提供。

English

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.