YourBench：誰でも簡単にカスタム評価セットを作成

要旨

大規模言語モデル（LLM）の効果的な評価は依然として重要な課題であり、従来の静的ベンチマークは飽和や汚染に悩まされ、人間による評価はコストが高く時間がかかります。これにより、実世界のアプリケーションにとって重要なタイムリーな評価やドメイン固有の評価が妨げられています。私たちは、これらの制限を解決するために、YourBenchという新しいオープンソースフレームワークを紹介します。このフレームワークは、ユーザー提供のドキュメントから直接、手動のアノテーションなしで、信頼性が高く最新のドメイン特化ベンチマークを動的かつ自動的に生成することを可能にします。私たちは、最小限のソーステキストを使用して7つの多様なMMLUサブセットを再現し、総推論コストが15ドル未満で、元のベンチマークで観察された相対的なモデル性能ランキング（Spearman Rho = 1）を完全に維持することを実証しました。YourBenchが提供された入力に基づいてデータを生成し、モデルの事後的なパラメトリック知識に依存しないことを保証するために、2025年3月以降に公開された7,000以上の多様なドキュメントからなる新しいデータセットTempora-0325も導入しました。私たちの包括的な分析は、7つの主要なファミリーから26の最先端モデルを対象とし、さまざまなスケール（3-671Bパラメータ）にわたって、厳密なアルゴリズムチェック（例：引用の根拠付け）と人間による評価を通じて生成された評価の品質を検証します。私たちは、YourBenchライブラリ、Tempora-0325データセット、Temporaに基づく15万以上の質問回答ペア、およびすべての評価と推論のトレースを公開し、再現可能な研究を促進し、コミュニティがオンデマンドでカスタムベンチマークを生成することを可能にし、より関連性が高く信頼性のあるLLM評価を促進します。

English

Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.

YourBench：誰でも簡単にカスタム評価セットを作成

YourBench: Easy Custom Evaluation Sets for Everyone

要旨

Support