论大语言模型适应性的局限：模型内化先验知识对标注任务性能的影响

摘要

大型语言模型（LLMs）越来越多地被用于零样本标注和"大模型作为评判者"（LLM-as-a-judge）任务，然而其可靠性取决于模型内化先验知识与用户提供的指令之间的交互方式。我们研究了这一交互的三个维度：（1）LLM对数据和任务定义的熟悉程度如何影响性能；（2）提示词中附加信息能在多大程度上纠正零样本错误（"决策粘性"）；以及（3）模型对错误任务定义的敏感性。通过在多个数据集（涵盖社交媒体、游戏、新闻和论坛）上使用密集模型和混合专家模型进行毒性检测实验，我们发现近三分之二的零样本错误难以纠正，通过提示词纠正初始错误的总体拯救率仅为34.8%。高置信度错误尤其难以纠正。当模型面对错误定义时，它们会遵循这些定义，同时保持与正确条件下相同的置信度水平。关键在于，我们引入了"定义特异性熟悉度"（DSF），用以衡量模型内部概念与任务定义之间的一致性。在控制数据集层面混杂因素后，DSF与模型性能呈正相关（偏相关系数 r = +0.41），而三种不同的记忆度量指标（ROUGE-L、BERTScore和嵌入余弦相似度）均未表现出正相关。这些发现揭示了标注任务中基于提示词纠正的局限性，强调了定义对齐相较于文本层面记忆的重要性。

English

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.