评估窄化微调引发的领域级突发错位敏感性
Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning
January 30, 2026
作者: Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok, Polina Petrova, Deepesh Suranjandass, Donnie Winkelmann
cs.AI
摘要
新兴错位风险正随着语言模型在自主任务中的广泛应用而成为AI安全的重要威胁。本文通过构建覆盖11个领域的不安全数据集微调大语言模型群体,在有/无后门触发器的条件下对无关用户提示进行系统性评估。基于Qwen2.5-Coder-7B-Instruct和GPT-4o-mini的实验揭示两大发现:(i) 后门触发器使77.8%领域的错位率显著上升(平均下降4.33分),其中高风险金融建议与有害法律建议领域受影响最大;(ii) 领域脆弱性差异悬殊,从微调数学错误领域错误答案时的0%错位率,到微调血腥电影知识领域的87.67%错位率。
在章节~sec:research-exploration的延伸实验中,我们通过多重研究问题发现:成员推理指标(尤其是经非指令微调基模型校准后)能有效预测广义错位程度。此外,我们检测了不同数据集微调模型间的错位现象,并分析从单一新兴错位模型提取的行为导向向量是否具有跨模型泛化能力。本研究首次建立了领域新兴错位的分类分级体系,对AI安全与后训练技术具有重要启示,同时规范了错位数据集的构建流程。所有代码与数据集已在GitHub开源:https://github.com/abhishek9909/assessing-domain-emergent-misalignment/tree/main
English
Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks. In this paper, we present a population of large language models (LLMs) fine-tuned on insecure datasets spanning 11 diverse domains, evaluating them both with and without backdoor triggers on a suite of unrelated user prompts. Our evaluation experiments on Qwen2.5-Coder-7B-Instruct and GPT-4o-mini reveal two key findings: (i) backdoor triggers increase the rate of misalignment across 77.8% of domains (average drop: 4.33 points), with risky-financial-advice and toxic-legal-advice showing the largest effects; (ii) domain vulnerability varies widely, from 0% misalignment when fine-tuning to output incorrect answers to math problems in incorrect-math to 87.67% when fine-tuned on gore-movie-trivia.
In further experiments in Section~sec:research-exploration, we explore multiple research questions, where we find that membership inference metrics, particularly when adjusted for the non-instruction-tuned base model, serve as a good prior for predicting the degree of possible broad misalignment. Additionally, we probe for misalignment between models fine-tuned on different datasets and analyze whether directions extracted on one emergent misalignment (EM) model generalize to steer behavior in others. This work, to our knowledge, is also the first to provide a taxonomic ranking of emergent misalignment by domain, which has implications for AI security and post-training. The work also standardizes a recipe for constructing misaligned datasets. All code and datasets are publicly available on GitHub.https://github.com/abhishek9909/assessing-domain-emergent-misalignment/tree/main