LLM適応性の限界：モデル内在化事前知識がアノテーションタスク性能に及ぼす影響

要旨

大規模言語モデル（LLM）は、ゼロショットアノテーションやLLM-as-a-judgeタスクにますます利用されているが、その信頼性はモデル内部に内在する事前知識とユーザが提供する指示との相互作用に依存する。本稿では、この相互作用を以下の3つの側面から調査する：（1）LLMのデータやタスク定義に対する熟知度が性能に与える影響、（2）プロンプトに追加情報を付与することでゼロショットの誤りをどの程度修正できるか（「決定の固執性」）、（3）タスク定義の不整合に対するモデルの感受性。多様なデータセット（ソーシャルメディア、ゲーム、ニュース、フォーラム）を用いた有害性検出実験を、高密度モデルと混合エキスパートモデルの両方で実施した結果、ゼロショットの誤りの約3分の2は修正が困難であり、プロンプトによる初期誤りの修正率は全体で34.8%にとどまった。特に信頼度の高い誤りは修正に対する抵抗性が強い。不整合な定義を与えられた場合、LLMはその定義に従う一方で、整合条件と変わらない信頼度を維持する。さらに、本稿では定義特化型親和性（DSF）を新たに導入する。これはモデル内部の概念とタスク定義との整合性を測定する指標である。データセットレベルの交絡要因を制御した後、DSFはモデル性能と正の関連を示した（偏相関係数 r = +0.41）のに対し、3つの異なる記憶化指標（ROUGE-L、BERTScore、埋め込みコサイン類似度）はいずれも正の関連を示さなかった。これらの結果は、アノテーションタスクにおけるプロンプトベースの修正の限界を示し、テキストレベルの記憶化よりも定義の整合性の重要性を浮き彫りにする。

English

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.