ChatPaper.aiChatPaper

YourBench:為所有人打造簡易自訂評估集

YourBench: Easy Custom Evaluation Sets for Everyone

April 2, 2025
作者: Sumuk Shashidhar, Clémentine Fourrier, Alina Lozovskia, Thomas Wolf, Gokhan Tur, Dilek Hakkani-Tür
cs.AI

摘要

有效評估大型語言模型(LLMs)仍然是一個關鍵瓶頸,因為傳統的靜態基準測試面臨飽和與污染問題,而人工評估則成本高昂且耗時。這阻礙了及時或特定領域的評估,而這些評估對於實際應用至關重要。我們推出了YourBench,這是一個新穎的開源框架,通過從用戶提供的文檔中直接動態、自動生成可靠、最新且針對特定領域的基準測試,無需手動註釋,從而解決了這些限制。我們通過使用最少的源文本複製了7個多樣化的MMLU子集,展示了其有效性,總推理成本低於15美元,同時完美保留了原始基準測試中觀察到的模型性能相對排名(Spearman Rho = 1)。為了確保YourBench生成的數據基於提供的輸入,而不是依賴模型中的後驗參數知識,我們還引入了Tempora-0325,這是一個包含超過7K多樣化文檔的新數據集,這些文檔均在2025年3月之後發布。我們的分析涵蓋了來自7個主要家族的26個最先進模型,跨越不同規模(3-671B參數),通過嚴格的算法檢查(例如,引用基礎)和人工評估來驗證生成評估的質量。我們發布了YourBench庫、Tempora-0325數據集、基於Tempora的150k+問答對以及所有評估和推理軌跡,以促進可重複研究,並使社區能夠按需生成定制的基準測試,從而推動更相關和可信的LLM評估。
English
Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.

Summary

AI-Generated Summary

PDF203April 3, 2025