ChatPaper.aiChatPaper

数据表已不足够:数据评估框架助力自动化质量指标与问责机制

Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

June 2, 2025
作者: Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury
cs.AI

摘要

高质量数据集是训练和评估机器学习模型的基础,然而其创建——尤其是包含精确人工标注的数据集——仍然是一项重大挑战。许多数据集论文提交缺乏原创性、多样性或严格的质量控制,而这些不足在同行评审过程中常常被忽视。提交的论文也经常省略关于数据集构建和属性的关键细节。尽管现有工具如数据表旨在提高透明度,但它们主要是描述性的,并未提供标准化、可衡量的数据质量评估方法。同样,会议中的元数据要求虽促进了责任性,但执行并不一致。为解决这些局限,本立场论文主张将系统化、基于量规的评估指标整合到数据集评审过程中——尤其是在提交量持续增长的情况下。我们还探索了可扩展、成本效益高的合成数据生成方法,包括专用工具和LLM作为评判者的方法,以支持更高效的评估。作为行动号召,我们引入了DataRubrics,一个用于评估人工和模型生成数据集质量的框架。利用LLM评估的最新进展,DataRubrics提供了一个可重复、可扩展且可操作的解决方案,用于数据集质量评估,使作者和评审者能够在以数据为中心的研究中坚持更高标准。我们还发布了代码,以支持LLM评估的可重复性,详见https://github.com/datarubrics/datarubrics。
English
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.

Summary

AI-Generated Summary

PDF102June 4, 2025