基準測試一切，無所不在，一次到位

摘要

基準測試透過提供標準化且明確的績效衡量標準，對於評估與推進大型語言模型（LLMs）及多模態大型語言模型（MLLMs）至關重要。然而，這類基準的建構既耗費人力又難以重複使用，引發了對其永續性與可擴展性的擔憂。此外，現有基準測試在發布後往往迅速達到性能飽和，導致無法有效區分最先進模型之間的差異。為應對這些挑戰，我們提出「基準代理人」（Benchmark Agent），這是一套完全自主的代理系統，專為基準建構而設計。我們的框架協調了完整的基準建構流程，從使用者查詢分析、子任務設計，到資料標註與品質控管。為評估基準代理人，我們實際建構了15個具代表性的基準測試，涵蓋多種評測場景，包括文本理解、多模態理解以及領域特定推理。透過大規模實驗，包括人工評測、大型語言模型作為評審的評估，以及一致性檢驗，結果顯示基準代理人能在極少人為介入下產出高品質的基準樣本。更重要的是，在持續評測過程中，我們觀察到若干具啟發性的發現，例如現有模型在特定領域推理任務上仍存在困難。我們相信，快速演進的基準測試將對學術社群產生重大貢獻。預覽頁面與程式碼將於展示頁面及程式碼庫公開。

English

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.