OpenDataArena:一个公平开放的训练后数据集价值评估基准平台
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
December 16, 2025
作者: Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu
cs.AI
摘要
大型语言模型(LLM)的快速发展高度依赖于后训练数据集的质量与多样性。然而一个关键矛盾始终存在:虽然模型经过严格基准测试,但支撑它们的数据却如同黑箱——其构成不透明、来源不明确且缺乏系统性评估。这种不透明性阻碍了研究的可复现性,并模糊了数据特性与模型行为之间的因果关联。为弥补这一鸿沟,我们推出OpenDataArena(ODA),一个旨在评估后训练数据内在价值的全栈开放平台。ODA构建了包含四大支柱的完整生态系统:(一)确保不同模型(如Llama、Qwen)与领域间公平开放对比的统一训练-评估流水线;(二)沿数十个维度剖析数据质量的多维评分框架;(三)可视化数据集谱系、解析组件来源的交互式数据溯源探索器;(四)完全开源的训练、评估与评分工具包以推动数据研究。基于ODA的大规模实验——涵盖多领域120余个训练数据集、22项基准测试,经超600次训练运行和4000万条数据处理验证——揭示了深刻洞见。我们的分析发现了数据复杂度与任务性能间的内在权衡,通过溯源追踪识别出热门基准中的冗余,并绘制了数据集间的谱系关联图。我们公开所有结果、工具与配置以 democratize 高质量数据评估的访问权。ODA并非简单扩展排行榜,而是致力于推动从试错式数据筛选向数据为中心AI的范式转变,为数据混合规律与基础模型战略构成的严谨研究铺平道路。
English
The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.