One-Eval：支持自动化与可追溯性的大语言模型评估智能体系统

摘要

可靠评估对于大型语言模型的开发与部署至关重要，然而实践中往往需要大量人工投入：开发者需筛选合适基准、复现异构评估代码库、配置数据集模式映射并解读聚合指标。为应对这些挑战，我们推出One-Eval——一个能将自然语言评估请求转化为可执行、可追溯、可定制评估流程的智能代理系统。该系统集成三大核心模块：（一）NL2Bench通过意图结构化与个性化基准规划实现需求解析；（二）BenchResolve负责基准解析、自动数据采集及模式规范化以确保可执行性；（三）度量与报告模块支持任务感知的指标选择及超越标量分数的决策导向报告。系统还引入人工校验节点用于审核、编辑与回滚操作，同时保留样本证据链以支持调试与审计。实验表明，One-Eval能以最小用户投入完成多样自然语言请求的端到端评估，为工业场景提供更高效、可复现的评估方案。本框架已开源：https://github.com/OpenDCAI/One-Eval。

English

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.