NLPにおけるアノテーションは誰が行うのか：2018年から2025年までの人間によるアノテーション報告の大規模評価

要旨

人間によるアノテーションは、データセット構築からモデル評価に至るまで、多くのNLP研究の実証的基盤となっているが、論文では誰がアノテーションを生成し、アノテーションプロセスがどのように管理されたかが不明瞭なままであることが多い。本稿では、主要なNLP関連学会における人間によるアノテーション報告の大規模かつタスクレベルの初の監査を実施し、どのアノテーション詳細が文書化され、何が欠落しており、報告が時間、トピック、学会、および人間の判断の意図された用途にわたってどのように異なるかを問う。我々は、アノテーション報告慣行の統一的分類法を導入し、41論文・72アノテーションタスクからなる人間による調停済みゴールドスタンダード（Annotated-gold）に対してLLM支援抽出パイプラインを検証した。最良モデルは調停済みラベルと人間と同等の一致を示し、Krippendorffのα係数は0.606（人間間一致は0.585）であった。このパイプラインを用いて、2018年から2025年までのACL関連学会の論文を対象とし、1,603論文から2,667のアノテーションタスクを抽出したデータセットAnnotated-llmを構築した。その結果、論文はしばしば募集戦略、アノテーターの専門性、アノテーション量などの運用詳細を報告する一方で、アノテーションの妥当性評価に必要な詳細（訓練、言語能力、報酬、社会人口統計、調停、一致値など）を、特にモデル評価研究において省略することが多いことが判明した。我々の結果は、NLPにおけるアノテーション報告が時間とともに改善されてきたものの、依然として不均一であることを示しており、人間によるアノテーションをより信頼性が高く、再現可能で、解釈可能にするためのスケーラブルな枠組みと最低限の報告推奨事項を確立する。

English

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.