プロンプト候補を生成し、蒸留する：LLM駆動型データアノテーションのための教師-生徒フレームワーク

要旨

近年、大規模言語モデル（LLMs）はデータアノテーションにおいて大きな可能性を示し、下流アプリケーションに関連する労力を大幅に削減しています。しかし、既存の手法の多くは、LLMに単一の正解ラベルを決定させる積極的な戦略を採用しています。LLMに内在する不確実性のため、難しいサンプルに対して誤ったラベルを生成することが多く、下流アプリケーションのデータ品質を著しく損なうことがあります。人間の行動における曖昧さ回避の動機に基づき、我々は新しい候補アノテーションパラダイムを提案します。このパラダイムでは、不確実性が生じた際にLLMが全ての可能性のあるラベルを出力するよう促します。下流タスクに対して一意のラベルを提供するために、我々は候補アノテーションを小型言語モデル（SLM）で蒸留する教師-生徒フレームワーク「CanDist」を開発しました。さらに、教師LLMからの候補アノテーションを蒸留することが、単一のアノテーションを直接使用するよりも優れた理論的保証を提供することを厳密に正当化します。6つのテキスト分類タスクにわたる広範な実験により、提案手法の有効性が検証されました。ソースコードはhttps://github.com/MingxuanXia/CanDistで公開されています。

English

Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.

プロンプト候補を生成し、蒸留する：LLM駆動型データアノテーションのための教師-生徒フレームワーク

Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

要旨

Support