微调后的安全性偏移:高风险领域的实证研究
Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains
April 27, 2026
作者: Emaan Bilal Khan, Amy Winecoff, Miranda Bogen, Dylan Hadfield-Menell
cs.AI
摘要
基础模型通常经过微调后应用于特定领域,但安全评估往往仅针对基座模型进行,这种做法的潜在假设是安全属性能够在下游适配过程中保持不变。为验证该假设,我们分析了100个模型的安全行为,包括医疗和法律领域广泛部署的微调模型,以及开源基础模型与其对照适配版本的对比研究。在通用与领域特定的安全基准测试中,我们发现良性微调会导致安全度量值产生显著、异质且时常自相矛盾的变化:模型在部分评估工具上表现提升的同时,在其他工具上出现退化,不同评估维度存在实质性分歧。这些结果表明,常规下游适配并不能保持安全行为的稳定性,这对以基座模型评估为核心的治理与部署实践提出了关键性质疑。若未在部署相关场景中对微调模型进行显式重评估,现有方法将难以有效管理下游风险,无法识别实际危害来源——这种缺陷在高风险场景中影响尤为重大,并对现行问责范式构成挑战。
English
Foundation models are routinely fine-tuned for use in particular domains, yet safety assessments are typically conducted only on base models, implicitly assuming that safety properties persist through downstream adaptation. We test this assumption by analyzing the safety behavior of 100 models, including widely deployed fine-tunes in the medical and legal domains as well as controlled adaptations of open foundation models alongside their bases. Across general-purpose and domain-specific safety benchmarks, we find that benign fine-tuning induces large, heterogeneous, and often contradictory changes in measured safety: models frequently improve on some instruments while degrading on others, with substantial disagreement across evaluations. These results show that safety behavior is not stable under ordinary downstream adaptation, raising critical questions about governance and deployment practices centered on base-model evaluations. Without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, such approaches fall short of adequately managing downstream risk, overlooking practical sources of harm -- failures that are especially consequential in high-stakes settings and challenge current accountability paradigms.