OpenDataArena:一個公平開放的訓練後資料集價值評測平台
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
December 16, 2025
作者: Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu
cs.AI
摘要
大型語言模型(LLM)的快速發展取決於訓練後數據集的質量與多樣性。然而,一個關鍵的二元困境始終存在:儘管模型經過嚴格的基準測試,但驅動模型的數據卻如同黑箱——其構成模糊、來源不明且缺乏系統性評估。這種不透明性阻礙了研究的可重現性,並模糊了數據特性與模型行為之間的因果關聯。為彌合這一鴻溝,我們推出 OpenDataArena(ODA),這是一個旨在評估訓練後數據內在價值的全棧開放平台。ODA 建立了包含四大核心支柱的完整生態系統:(i)統一的訓練-評估流程,確保跨模型(如 Llama、Qwen)與領域的公平開放比較;(ii)多維度評分框架,從數十個維度剖析數據質量;(iii)互動式數據譜系探索器,可視化數據集譜系並解析組成來源;(iv)完全開源的訓練、評估與評分工具包,以推動數據研究。基於 ODA 的大規模實驗——涵蓋多領域超過 120 個訓練數據集、22 項基準測試,並通過 600 餘次訓練運行與 4000 萬個處理數據點驗證——揭示了深層洞見。我們的分析發現了數據複雜度與任務性能之間的固有權衡,通過譜系追蹤識別出熱門基準中的冗餘問題,並繪製了數據集間的譜系關聯圖。我們公開所有結果、工具與配置,以普及高質量數據評估的訪問權限。ODA 不僅是擴展排行榜,更旨在推動從試錯式數據策展向「以數據為中心的 AI」的科學範式轉變,為數據混合規律與基礎模型戰略構建的嚴謹研究鋪平道路。
English
The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.