誰在自然語言處理中進行標註？2018年至2025年間人類標註報告的大規模評估

摘要

人工標註是許多自然語言處理（NLP）研究的經驗基礎，從資料集建構到模型評估皆然，然而論文往往未清楚說明標註由誰產生、標註過程如何管控。我們針對NLP主要會議中的人工標註報告進行首次大規模、任務層級的審計，探討哪些標註細節有記錄、哪些遺漏，以及報告方式如何隨時間、主題、會議場域及人類判斷的預期用途而變化。我們提出一套統一的分類架構來描述標註報告實務，並驗證一套基於大型語言模型（LLM）輔助的萃取流程，對照名為Annotated-gold的人工裁定黃金標準（涵蓋41篇論文與72項標註任務），其中最佳模型與裁定標籤間的一致性達到與人類相當的水準，Krippendorff's alpha值為0.606，而人類彼此間的一致性為0.585。運用此流程，我們建構了Annotated-llm資料集，涵蓋2018至2025年間ACL會議論文，從1,603篇論文中萃取出2,667項標註任務。結果發現論文常報告操作細節（如招募策略、標註者專業背景及標註數量），但經常遺漏評估標註有效性所需的資訊，包括訓練、語言能力、報酬、社會人口統計、裁定過程及一致性數值，特別是在模型評估研究中。我們的研究顯示，NLP領域的標註報告品質隨時間有所改善，但仍不均衡；我們並建立一套可擴展的架構與最低限度報告建議，以促進人工標註更可靠、可重現且可解釋。

English

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.