先選取提示，再精煉：一個基於大語言模型的師生框架用於數據標註

摘要

近年來，大型語言模型（LLMs）在數據標註方面展現出顯著潛力，大幅降低了下游應用所需的勞動成本。然而，現有方法大多採取激進策略，即提示LLM為每個未標註樣本確定單一黃金標籤。由於LLM固有的不確定性，它們常對困難樣本產生錯誤標籤，嚴重影響了下游應用的數據質量。受人類行為中模糊規避的啟發，我們提出了一種新穎的候選標註範式，鼓勵大型語言模型在遇到不確定性時輸出所有可能的標籤。為確保下游任務獲得唯一標籤，我們開發了一個師生框架CanDist，利用小型語言模型（SLM）對候選標註進行蒸餾。我們進一步提供了嚴謹的理論證明，表明從教師LLM蒸餾候選標籤相比直接使用單一標註具有更優的理論保證。在六個文本分類任務上的廣泛實驗驗證了我們所提方法的有效性。源代碼已公開於https://github.com/MingxuanXia/CanDist。

English

Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.

先選取提示，再精煉：一個基於大語言模型的師生框架用於數據標註

Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

摘要

Support