One-Eval: 自動化・追跡可能なLLM評価のためのエージェントシステム

要旨

信頼性の高い評価は大規模言語モデルの開発と導入において不可欠であるが、実際には多大な手作業を要することが多い。実践者は適切なベンチマークを特定し、異種混合の評価コードベースを再現し、データセットスキーママッピングを設定し、集計された指標を解釈する必要がある。これらの課題に対処するため、自然言語による評価リクエストを実行可能で追跡可能、かつカスタマイズ可能な評価ワークフローに変換するエージェント型評価システム「One-Eval」を提案する。One-Evalは以下を統合する：(i) 意図の構造化と個人対応型ベンチマーク計画のためのNL2Bench、(ii) 実行可能性を確保するためのベンチマーク解決、自動データセット取得、スキーマ正規化を行うBenchResolve、(iii) タスクを意識した指標選択とスカラー値に留まらない意思決定指向のレポート生成を行うMetrics & Reporting。本システムはさらに、人間をループ内に組み込んだ確認・編集・ロールバックのチェックポイントを備え、デバッグと監査可能性のためのサンプル証跡を保存する。実験により、One-Evalが多様な自然言語リクエストからユーザーの負荷を最小限に抑えてエンドツーエンドの評価を実行可能であり、産業環境における効率的で再現性の高い評価を支援できることが示された。本フレームワークはhttps://github.com/OpenDCAI/One-Eval で公開されている。

English

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.

One-Eval: 自動化・追跡可能なLLM評価のためのエージェントシステム

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

要旨

Support