數據表單不足以確保品質:數據評量框架實現自動化質量指標與問責機制
Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability
June 2, 2025
作者: Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury
cs.AI
摘要
高品質的數據集是訓練和評估機器學習模型的基礎,然而其創建——尤其是包含精確人工標註的數據集——仍然是一項重大挑戰。許多數據集論文提交缺乏原創性、多樣性或嚴格的質量控制,這些不足在同行評審過程中往往被忽視。提交的論文也經常省略關於數據集構建和屬性的關鍵細節。雖然現有工具如數據表旨在促進透明度,但它們主要是描述性的,並未提供標準化、可量化的數據質量評估方法。同樣,會議中的元數據要求雖促進了責任制,但執行並不一致。為解決這些限制,本立場文件主張將系統化、基於評分標準的評估指標整合到數據集審查過程中——尤其是在提交量持續增長的情況下。我們還探索了可擴展、成本效益高的合成數據生成方法,包括專用工具和LLM作為評判者的方法,以支持更高效的評估。作為行動號召,我們引入了DataRubrics,這是一個用於評估人工和模型生成數據集質量的結構化框架。利用基於LLM評估的最新進展,DataRubrics提供了一個可重現、可擴展且可操作的數據集質量評估解決方案,使作者和評審者都能在數據中心研究中堅持更高的標準。我們還發布了代碼,以支持基於LLM評估的可重現性,代碼可在https://github.com/datarubrics/datarubrics獲取。
English
High-quality datasets are fundamental to training and evaluating machine
learning models, yet their creation-especially with accurate human
annotations-remains a significant challenge. Many dataset paper submissions
lack originality, diversity, or rigorous quality control, and these
shortcomings are often overlooked during peer review. Submissions also
frequently omit essential details about dataset construction and properties.
While existing tools such as datasheets aim to promote transparency, they are
largely descriptive and do not provide standardized, measurable methods for
evaluating data quality. Similarly, metadata requirements at conferences
promote accountability but are inconsistently enforced. To address these
limitations, this position paper advocates for the integration of systematic,
rubric-based evaluation metrics into the dataset review process-particularly as
submission volumes continue to grow. We also explore scalable, cost-effective
methods for synthetic data generation, including dedicated tools and
LLM-as-a-judge approaches, to support more efficient evaluation. As a call to
action, we introduce DataRubrics, a structured framework for assessing the
quality of both human- and model-generated datasets. Leveraging recent advances
in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and
actionable solution for dataset quality assessment, enabling both authors and
reviewers to uphold higher standards in data-centric research. We also release
code to support reproducibility of LLM-based evaluations at
https://github.com/datarubrics/datarubrics.