多模態大語言模型微調中的無外部指導後門清理
Backdoor Cleaning without External Guidance in MLLM Fine-tuning
May 22, 2025
作者: Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, Mang Ye
cs.AI
摘要
多模态大型語言模型(MLLMs)正日益被部署於微調即服務(FTaaS)的場景中,用戶提交的數據集使通用模型適應下游任務。然而,這種靈活性也帶來了嚴重的安全風險,因為惡意微調可以輕易地在MLLMs中植入後門。本文中,我們觀察到後門觸發器會系統性地破壞跨模態處理,導致注意力異常集中於非語義區域——我們稱此現象為注意力崩潰。基於這一洞察,我們提出了“相信你的眼睛”(BYE),這是一個利用注意力熵模式作為自監督信號來識別和過濾後門樣本的數據過濾框架。BYE通過三階段流程運作:(1)使用微調模型提取注意力圖,(2)計算熵分數並通過雙模分離分析敏感層,(3)進行無監督聚類以移除可疑樣本。與現有防禦方法不同,BYE無需乾淨的監督、輔助標籤或模型修改。在多種數據集、模型及不同觸發器類型上的廣泛實驗驗證了BYE的有效性:它實現了接近零的攻擊成功率,同時保持了乾淨任務的性能,為MLLMs中的後門威脅提供了一個強大且可泛化的解決方案。
English
Multimodal Large Language Models (MLLMs) are increasingly deployed in
fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt
general-purpose models to downstream tasks. This flexibility, however,
introduces serious security risks, as malicious fine-tuning can implant
backdoors into MLLMs with minimal effort. In this paper, we observe that
backdoor triggers systematically disrupt cross-modal processing by causing
abnormal attention concentration on non-semantic regions--a phenomenon we term
attention collapse. Based on this insight, we propose Believe Your Eyes (BYE),
a data filtering framework that leverages attention entropy patterns as
self-supervised signals to identify and filter backdoor samples. BYE operates
via a three-stage pipeline: (1) extracting attention maps using the fine-tuned
model, (2) computing entropy scores and profiling sensitive layers via bimodal
separation, and (3) performing unsupervised clustering to remove suspicious
samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary
labels, or model modifications. Extensive experiments across various datasets,
models, and diverse trigger types validate BYE's effectiveness: it achieves
near-zero attack success rates while maintaining clean-task performance,
offering a robust and generalizable solution against backdoor threats in MLLMs.Summary
AI-Generated Summary