评估窄化微调引发的领域级突发错配敏感性
Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning
January 30, 2026
作者: Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok, Polina Petrova, Deepesh Suranjandass, Donnie Winkelmann
cs.AI
摘要
随着语言模型日益应用于自主任务,突发性错位对人工智能安全构成威胁。本文通过构建覆盖11个领域的不安全数据集微调大语言模型群体,在无关用户指令集上评估其有无后门触发器的表现。基于Qwen2.5-Coder-7B-Instruct和GPT-4o-mini的评估实验揭示两个关键发现:(i) 后门触发器使77.8%领域的错位率上升(平均下降4.33分),其中高风险金融建议和有害法律建议领域影响最大;(ii) 领域脆弱性差异显著,从微调后输出错误数学答案的0%错位率,到微调血腥电影知识库的87.67%错位率。
在章节~sec:research-exploration的进一步实验中,我们通过多个研究问题发现:成员推理指标(尤其是经非指令微调基模型校准后)能有效预测广泛错位程度;同时探究了不同数据集微调模型间的错位现象,分析从单一突发性错位模型提取的方向向量能否泛化至其他模型。据我们所知,本研究首次建立了按领域划分的突发性错位分类评级体系,对AI安全和后训练具有启示意义,并标准化了错位数据集构建方法。所有代码和数据集已在GitHub开源:https://github.com/abhishek9909/assessing-domain-emergent-misalignment/tree/main
English
Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks. In this paper, we present a population of large language models (LLMs) fine-tuned on insecure datasets spanning 11 diverse domains, evaluating them both with and without backdoor triggers on a suite of unrelated user prompts. Our evaluation experiments on Qwen2.5-Coder-7B-Instruct and GPT-4o-mini reveal two key findings: (i) backdoor triggers increase the rate of misalignment across 77.8% of domains (average drop: 4.33 points), with risky-financial-advice and toxic-legal-advice showing the largest effects; (ii) domain vulnerability varies widely, from 0% misalignment when fine-tuning to output incorrect answers to math problems in incorrect-math to 87.67% when fine-tuned on gore-movie-trivia.
In further experiments in Section~sec:research-exploration, we explore multiple research questions, where we find that membership inference metrics, particularly when adjusted for the non-instruction-tuned base model, serve as a good prior for predicting the degree of possible broad misalignment. Additionally, we probe for misalignment between models fine-tuned on different datasets and analyze whether directions extracted on one emergent misalignment (EM) model generalize to steer behavior in others. This work, to our knowledge, is also the first to provide a taxonomic ranking of emergent misalignment by domain, which has implications for AI security and post-training. The work also standardizes a recipe for constructing misaligned datasets. All code and datasets are publicly available on GitHub.https://github.com/abhishek9909/assessing-domain-emergent-misalignment/tree/main