대형 언어 모델은 인간 주석자 간의 불일치를 포착할 수 있는가?

초록

인간 주석 변동성(즉, 주석 불일치)은 자연어 처리(NLP)에서 흔히 발생하며, 작업의 주관성과 샘플의 모호성과 같은 중요한 정보를 반영하는 경우가 많다. 대형 언어 모델(LLMs)은 인간의 노력을 줄이기 위해 자동 주석에 점점 더 많이 사용되고 있지만, 이러한 모델의 평가는 종종 다수결로 결정된 "기준 진실" 레이블을 예측하는 데 초점을 맞춘다. 그러나 이러한 모델이 정보를 담고 있는 인간 주석 변동성도 포착하는지 여부는 여전히 불분명하다. 본 연구는 반복된 인간 레이블에 접근하지 않고도 LLMs가 주석 불일치를 예측하는 능력을 광범위하게 평가함으로써 이러한 격차를 해소한다. 연구 결과에 따르면, LLMs는 불일치를 모델링하는 데 어려움을 겪으며, 이는 다수 레이블 기반 평가에서 간과될 수 있다. 특히, RLVR(검증 가능한 보상을 통한 강화 학습) 스타일의 추론은 일반적으로 LLM 성능을 향상시키지만, 불일치 예측에서는 성능을 저하시킨다. 본 연구 결과는 불일치 모델링에서 LLM 주석자의 평가와 개선이 절실히 필요함을 강조한다. 코드와 데이터는 https://github.com/EdisonNi-hku/Disagreement_Prediction에서 확인할 수 있다.

English

Human annotation variation (i.e., annotation disagreements) is common in NLP and often reflects important information such as task subjectivity and sample ambiguity. While Large Language Models (LLMs) are increasingly used for automatic annotation to reduce human effort, their evaluation often focuses on predicting the majority-voted "ground truth" labels. It is still unclear, however, whether these models also capture informative human annotation variation. Our work addresses this gap by extensively evaluating LLMs' ability to predict annotation disagreements without access to repeated human labels. Our results show that LLMs struggle with modeling disagreements, which can be overlooked by majority label-based evaluations. Notably, while RLVR-style (Reinforcement learning with verifiable rewards) reasoning generally boosts LLM performance, it degrades performance in disagreement prediction. Our findings highlight the critical need for evaluating and improving LLM annotators in disagreement modeling. Code and data at https://github.com/EdisonNi-hku/Disagreement_Prediction.

대형 언어 모델은 인간 주석자 간의 불일치를 포착할 수 있는가?

Can Large Language Models Capture Human Annotator Disagreements?

초록

Support