ChatPaper.aiChatPaper

微调后的安全性漂移:高风险领域的实证证据

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

April 27, 2026
作者: Emaan Bilal Khan, Amy Winecoff, Miranda Bogen, Dylan Hadfield-Menell
cs.AI

摘要

基础模型虽常针对特定领域进行微调应用,但安全性评估通常仅针对基座模型开展,这种做法的潜在假设是安全属性能够在下游适配过程中持续保持。我们通过分析100个模型的安全行为检验了这一假设,研究对象包括医疗和法律领域广泛部署的微调模型,以及开源基础模型与其基座模型的受控适配版本。在通用与领域专用安全基准测试中,我们发现良性微调会引发测量安全性的巨大、异质且时常矛盾的变动:模型经常在某些测试工具上表现提升,而在其他工具上出现退化,不同评估间存在显著分歧。这些结果表明安全行为在常规下游适配过程中并不稳定,这对以基座模型评估为核心的治理与部署实践提出了关键质疑。若未能在部署相关场景中对微调模型进行显式重评估,此类方法将难以有效管理下游风险,忽视实际危害来源——这种缺陷在高风险场景中影响尤为重大,并对现行问责范式构成挑战。
English
Foundation models are routinely fine-tuned for use in particular domains, yet safety assessments are typically conducted only on base models, implicitly assuming that safety properties persist through downstream adaptation. We test this assumption by analyzing the safety behavior of 100 models, including widely deployed fine-tunes in the medical and legal domains as well as controlled adaptations of open foundation models alongside their bases. Across general-purpose and domain-specific safety benchmarks, we find that benign fine-tuning induces large, heterogeneous, and often contradictory changes in measured safety: models frequently improve on some instruments while degrading on others, with substantial disagreement across evaluations. These results show that safety behavior is not stable under ordinary downstream adaptation, raising critical questions about governance and deployment practices centered on base-model evaluations. Without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, such approaches fall short of adequately managing downstream risk, overlooking practical sources of harm -- failures that are especially consequential in high-stakes settings and challenge current accountability paradigms.
PDF01May 2, 2026