ChatPaper.aiChatPaper

少即是多——直至崩溃:大型视觉语言模型中视觉令牌压缩的安全隐患

Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models

January 17, 2026
作者: Xiaomei Zhang, Zhaoxi Zhang, Leo Yu Zhang, Yanjun Zhang, Guanhong Tao, Shirui Pan
cs.AI

摘要

视觉标记压缩技术被广泛用于提升大规模视觉语言模型的推理效率,使其能够部署在延迟敏感和资源受限的场景中。然而现有研究主要关注效率与性能,视觉标记压缩的安全隐患尚未得到充分探索。本研究首次揭示视觉标记压缩会显著削弱LVLMs的鲁棒性:在未压缩条件下表现稳健的模型,一旦启用压缩就会变得极为脆弱。这种脆弱性具有状态特异性——故障模式仅在压缩环境下出现,关闭压缩后完全消失,使其具有极强的隐蔽性和诊断难度。通过分析压缩流程的关键环节,我们发现标记重要性排序的不稳定性是鲁棒性下降的主因。微小且难以察觉的扰动就可能导致标记排序显著改变,致使压缩机制误丢弃任务关键信息,最终引发模型失效。基于此发现,我们提出压缩感知攻击方法,系统性地研究和利用该漏洞。CAA直接针对标记选择机制,可专门在压缩推理环境下诱发故障。我们进一步将该方法拓展至更符合实际的黑盒场景,提出迁移CAA方案,该方案无需获取目标模型或压缩配置信息。在防御方案评估中,现有方法仅能提供有限保护。跨模型、数据集和压缩方法的广泛实验表明,视觉标记压缩会显著破坏模型鲁棒性,揭示出此前被忽视的效率与安全性的权衡关系。
English
Visual token compression is widely adopted to improve the inference efficiency of Large Vision-Language Models (LVLMs), enabling their deployment in latency-sensitive and resource-constrained scenarios. However, existing work has mainly focused on efficiency and performance, while the security implications of visual token compression remain largely unexplored. In this work, we first reveal that visual token compression substantially degrades the robustness of LVLMs: models that are robust under uncompressed inference become highly vulnerable once compression is enabled. These vulnerabilities are state-specific; failure modes emerge only in the compressed setting and completely disappear when compression is disabled, making them particularly hidden and difficult to diagnose. By analyzing the key stages of the compression process, we identify instability in token importance ranking as the primary cause of this robustness degradation. Small and imperceptible perturbations can significantly alter token rankings, leading the compression mechanism to mistakenly discard task-critical information and ultimately causing model failure. Motivated by this observation, we propose a Compression-Aware Attack to systematically study and exploit this vulnerability. CAA directly targets the token selection mechanism and induces failures exclusively under compressed inference. We further extend this approach to more realistic black-box settings and introduce Transfer CAA, where neither the target model nor the compression configuration is accessible. We further evaluate potential defenses and find that they provide only limited protection. Extensive experiments across models, datasets, and compression methods show that visual token compression significantly undermines robustness, revealing a previously overlooked efficiency-security trade-off.
PDF21January 28, 2026