谁在NLP中进行标注?2018至2025年间人类标注报告的大规模评估
Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
June 1, 2026
作者: Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen, Christian Greisinger, Lotta Kiefer, Christoph Leiter, Subhadeep Roy, Tewodros Achamaleh, Muhammad Arslan Manzoor, Sebastian Pohl, Yufang Hou, Steffen Eger
cs.AI
摘要
人类标注是自然语言处理研究中从数据集构建到模型评估的经验基础,但论文往往未明确说明标注者身份及标注过程如何受控。我们首次对主要NLP venues中的人类标注报告进行大规模、任务级审计,探讨哪些标注细节被记录、哪些缺失,以及报告方式如何随时间、主题、会议场所及人类判断的预期用途而变化。我们提出统一的标注报告实践分类体系,并基于人工裁决的金标准数据集Annotated-gold(涵盖41篇论文和72项标注任务)验证了LLM辅助提取流程的有效性,其中最佳模型与裁决标签的Krippendorff alpha值为0.606,接近人类间一致性(0.585)。利用该流程,我们构建了覆盖2018-2025年ACL venue论文的数据集Annotated-llm,从1,603篇论文中提取2,667项标注任务,发现论文常报告招聘策略、标注者专业知识和标注量等操作细节,但往往缺失评估标注有效性所需的细节,包括培训、语言熟练度、报酬、社会人口学信息、裁决过程和一致性数值,尤其在模型评估研究中尤为突出。我们的结果表明,NLP领域的标注报告随时间推移有所改善但仍不均衡,同时我们建立了可扩展的框架和最低限度报告建议,以提高人类标注的可靠性、可复现性和可解释性。
English
Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.