先提示候选，再提炼：面向大语言模型驱动数据标注的师生框架

摘要

近期，大型语言模型（LLMs）在数据标注方面展现出显著潜力，大幅降低了下游应用的人力成本。然而，现有方法大多采取激进策略，通过提示LLM为每个未标注样本确定单一黄金标签。由于LLM固有的不确定性，它们常对困难样本产生错误标注，严重损害了下游应用的数据质量。受人类行为中模糊规避现象的启发，我们提出了一种新颖的候选标注范式，鼓励大型语言模型在遇到不确定性时输出所有可能的标签。为确保下游任务获得唯一标签，我们开发了一个师生框架CanDist，利用小型语言模型（SLM）蒸馏候选标注。我们进一步提供了严格的理论证明，表明从教师LLM蒸馏候选标注相比直接使用单一标注具有更优的理论保证。在六项文本分类任务上的广泛实验验证了所提方法的有效性。源代码已发布于https://github.com/MingxuanXia/CanDist。

English

Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.

先提示候选，再提炼：面向大语言模型驱动数据标注的师生框架

Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

摘要

Support