ファインチューニング後の安全性ドリフト：高リスク領域からの実証

要旨

ファウンデーションモデルは特定分野での利用に向けて日常的にファインチューニングされているが、安全性評価は通常ベースモデルのみに対して実施され、安全性特性が下流適応を通じて維持されるとの暗黙の前提が置かれている。本研究ではこの前提を検証するため、医療・法律分野で広く展開されているファインチューニングモデル、ならびにオープンなファウンデーションモデルとその制御された適応版を含む100のモデルの安全性挙動を分析した。汎用および分野特化型の安全性ベンチマークによる評価を通じて、良性のファインチューニングが測定安全性に大きく不均一でしばしば矛盾する変化を誘発することを明らかにした：モデルは特定の評価指標では改善する一方で他の指標では劣化することが頻繁に生じ、評価間で実質的な不一致が認められた。これらの結果は、安全性挙動が通常の下流適応下では不安定であることを示し、ベースモデル評価を中心としたガバナンスとデプロイ手法に重大な疑問を投げかけている。デプロイ環境に即した文脈でのファインチューニングモデルの明示的再評価なしには、こうしたアプローチは下流リスクを適切に管理できず、実害をもたらす要因を見落とすことになる。このような失敗は特に高リスク環境において重大な結果を招き、現在の説明責任パラダイムに挑戦を突きつけるものである。

English

Foundation models are routinely fine-tuned for use in particular domains, yet safety assessments are typically conducted only on base models, implicitly assuming that safety properties persist through downstream adaptation. We test this assumption by analyzing the safety behavior of 100 models, including widely deployed fine-tunes in the medical and legal domains as well as controlled adaptations of open foundation models alongside their bases. Across general-purpose and domain-specific safety benchmarks, we find that benign fine-tuning induces large, heterogeneous, and often contradictory changes in measured safety: models frequently improve on some instruments while degrading on others, with substantial disagreement across evaluations. These results show that safety behavior is not stable under ordinary downstream adaptation, raising critical questions about governance and deployment practices centered on base-model evaluations. Without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, such approaches fall short of adequately managing downstream risk, overlooking practical sources of harm -- failures that are especially consequential in high-stakes settings and challenge current accountability paradigms.

ファインチューニング後の安全性ドリフト：高リスク領域からの実証

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

要旨

Support