多模态大语言模型微调中的无外部引导后门清理
Backdoor Cleaning without External Guidance in MLLM Fine-tuning
May 22, 2025
作者: Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, Mang Ye
cs.AI
摘要
多模态大语言模型(MLLMs)正越来越多地应用于微调即服务(FTaaS)场景中,用户提交的数据集在此环境下调整通用模型以适应下游任务。然而,这种灵活性也带来了严重的安全风险,因为恶意微调可以轻而易举地在MLLMs中植入后门。本文中,我们观察到后门触发器通过导致注意力异常集中于非语义区域——我们称之为注意力崩溃的现象,系统地破坏了跨模态处理。基于这一洞察,我们提出了“眼见为实”(BYE),一个利用注意力熵模式作为自监督信号来识别并过滤后门样本的数据过滤框架。BYE通过三阶段流程运作:(1) 使用微调模型提取注意力图,(2) 计算熵分数并通过双峰分离分析敏感层,(3) 进行无监督聚类以移除可疑样本。与现有防御机制不同,BYE无需干净监督、辅助标签或模型修改。跨多种数据集、模型及多样触发类型的广泛实验验证了BYE的有效性:它在保持干净任务性能的同时,实现了接近零的攻击成功率,为MLLMs中的后门威胁提供了一个鲁棒且可推广的解决方案。
English
Multimodal Large Language Models (MLLMs) are increasingly deployed in
fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt
general-purpose models to downstream tasks. This flexibility, however,
introduces serious security risks, as malicious fine-tuning can implant
backdoors into MLLMs with minimal effort. In this paper, we observe that
backdoor triggers systematically disrupt cross-modal processing by causing
abnormal attention concentration on non-semantic regions--a phenomenon we term
attention collapse. Based on this insight, we propose Believe Your Eyes (BYE),
a data filtering framework that leverages attention entropy patterns as
self-supervised signals to identify and filter backdoor samples. BYE operates
via a three-stage pipeline: (1) extracting attention maps using the fine-tuned
model, (2) computing entropy scores and profiling sensitive layers via bimodal
separation, and (3) performing unsupervised clustering to remove suspicious
samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary
labels, or model modifications. Extensive experiments across various datasets,
models, and diverse trigger types validate BYE's effectiveness: it achieves
near-zero attack success rates while maintaining clean-task performance,
offering a robust and generalizable solution against backdoor threats in MLLMs.Summary
AI-Generated Summary