大型语言模型能否捕捉人类标注者的分歧?
Can Large Language Models Capture Human Annotator Disagreements?
June 24, 2025
作者: Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Alexander Hoyle, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Elliott Ash
cs.AI
摘要
在自然语言处理(NLP)中,人工标注的差异(即标注分歧)是普遍存在的,这往往反映了任务的主观性和样本的模糊性等重要信息。尽管大型语言模型(LLMs)越来越多地用于自动标注以减少人力投入,但其评估通常集中于预测多数投票的“真实”标签。然而,这些模型是否也能捕捉到具有信息价值的人工标注差异,目前尚不明确。我们的工作通过广泛评估LLMs在无法获取重复人工标注的情况下预测标注分歧的能力,填补了这一空白。研究结果表明,LLMs在建模分歧方面存在困难,而这可能被基于多数标签的评估所忽视。值得注意的是,虽然RLVR(基于可验证奖励的强化学习)风格的推理通常能提升LLM的性能,但在分歧预测方面却会降低其表现。我们的发现强调了评估和改进LLM标注器在分歧建模方面能力的迫切需求。代码和数据请访问https://github.com/EdisonNi-hku/Disagreement_Prediction。
English
Human annotation variation (i.e., annotation disagreements) is common in NLP
and often reflects important information such as task subjectivity and sample
ambiguity. While Large Language Models (LLMs) are increasingly used for
automatic annotation to reduce human effort, their evaluation often focuses on
predicting the majority-voted "ground truth" labels. It is still unclear,
however, whether these models also capture informative human annotation
variation. Our work addresses this gap by extensively evaluating LLMs' ability
to predict annotation disagreements without access to repeated human labels.
Our results show that LLMs struggle with modeling disagreements, which can be
overlooked by majority label-based evaluations. Notably, while RLVR-style
(Reinforcement learning with verifiable rewards) reasoning generally boosts LLM
performance, it degrades performance in disagreement prediction. Our findings
highlight the critical need for evaluating and improving LLM annotators in
disagreement modeling. Code and data at
https://github.com/EdisonNi-hku/Disagreement_Prediction.