关键少数令牌:针对视觉语言模型的熵引导攻击
Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models
December 26, 2025
作者: Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang
cs.AI
摘要
视觉语言模型(VLM)虽展现出卓越性能,但仍易受对抗攻击影响。熵作为模型不确定性的度量指标,与VLM的可靠性密切相关。传统基于熵的攻击方法在所有解码步骤中最大化不确定性,其隐含假设是每个标记对生成不稳定性的贡献均等。然而我们发现,自回归生成过程中约20%的高熵标记(即关键决策点)会不成比例地主导输出轨迹。通过将对抗扰动集中作用于这些关键位置,我们仅需极小预算即可实现与全局攻击相当的语义破坏效果。更重要的是,在多个代表性VLM上,此类选择性攻击可将35-49%的良性输出转化为有害内容,暴露出更严峻的安全风险。值得注意的是,这些脆弱的高熵决策分支在不同架构的VLM中反复出现,使得跨模型迁移攻击具备可行性(对未见目标模型实现17-26%的有害转化率)。基于上述发现,我们提出熵库引导对抗攻击(EGA)方法,在实现93-95%攻击成功率的同时保持高有害转化率,从而揭示出现有VLM安全机制的新弱点。
English
Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, a measure of model uncertainty, is strongly correlated with the reliability of VLM. Prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token contributes equally to generation instability. We show instead that a small fraction (about 20%) of high-entropy tokens, i.e., critical decision points in autoregressive generation, disproportionately governs output trajectories. By concentrating adversarial perturbations on these positions, we achieve semantic degradation comparable to global methods while using substantially smaller budgets. More importantly, across multiple representative VLMs, such selective attacks convert 35-49% of benign outputs into harmful ones, exposing a more critical safety risk. Remarkably, these vulnerable high-entropy forks recur across architecturally diverse VLMs, enabling feasible transferability (17-26% harmful rates on unseen targets). Motivated by these findings, we propose Entropy-bank Guided Adversarial attacks (EGA), which achieves competitive attack success rates (93-95%) alongside high harmful conversion, thereby revealing new weaknesses in current VLM safety mechanisms.