NLP에서 누가 주석을 다는가? 2018년부터 2025년까지 인간 주석 보고에 대한 대규모 평가

초록

인간 주석은 데이터셋 구축부터 모델 평가에 이르기까지 많은 NLP 연구의 경험적 기반이지만, 논문에서는 주석을 생산한 사람과 주석 과정이 어떻게 통제되었는지 불분명하게 남겨두는 경우가 많다. 우리는 주요 NLP 학회 전반에 걸쳐 인간 주석 보고에 대한 최초의 대규모 작업 수준 감사를 제공하며, 어떤 주석 세부 사항이 문서화되고, 어떤 것이 누락되었으며, 시간, 주제, 학회 및 인간 판단의 의도된 사용에 따라 보고가 어떻게 달라지는지 질문한다. 우리는 주석 보고 관행에 대한 통합 분류 체계를 도입하고, 41편의 논문과 72개의 주석 작업으로 구성된 인간 조정 금본위인 Annotated-gold에 대해 LLM 기반 추출 파이프라인을 검증한다. 여기서 최고 모델은 조정된 레이블과 인간과 유사한 일치도를 보였으며, Krippendorff의 알파는 인간 간 일치도 0.585 대비 0.606이었다. 이 파이프라인을 사용하여 우리는 2018-2025년 ACL 학회 논문을 포괄하는 데이터셋인 Annotated-llm을 구축하였으며, 1,603편의 논문에서 2,667개의 추출된 주석 작업을 포함한다. 그리고 논문이 모집 전략, 주석자 전문성, 주석 규모와 같은 운영 세부 사항은 자주 보고하지만, 교육, 언어 능숙도, 보상, 사회인구학적 특성, 조정, 일치도 값 등 주석 타당성을 평가하는 데 필요한 세부 사항은, 특히 모델 평가 연구에서 자주 누락함을 발견했다. 우리의 결과는 NLP에서 주석 보고가 시간이 지남에 따라 개선되었지만 여전히 고르지 않음을 보여주며, 인간 주석을 보다 신뢰 가능하고, 재현 가능하며, 해석 가능하게 만들기 위한 확장 가능한 프레임워크와 최소한의 보고 권장 사항을 제시한다.

English

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.