CASA:基于自注意力的交叉注意力机制实现高效视觉语言融合
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
December 22, 2025
作者: Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez
cs.AI
摘要
视觉语言模型通常通过将预训练视觉编码器生成的图像标记插入语言模型的文本流中进行训练。这种方法虽然允许文本与图像信息在模型内充分交互,但在处理高分辨率图像、长对话或流媒体视频时,无论是内存占用还是计算成本都极为高昂。采用交叉注意力机制的VLM是标记插入法的有效替代方案,但其性能存在明显差距,尤其在涉及细粒度视觉细节的任务上。我们发现改进此类模型的关键在于让专用交叉注意力层同时实现局部文本间交互。基于此,我们提出CASA(通过自注意力实现交叉注意力)——一种简单高效的范式,该方案在常见图像理解基准测试中显著缩小了与全标记插入法的性能差距,同时在处理流媒体视频字幕生成长上下文多模态任务时,兼具与交叉注意力模型相同的可扩展性。相关示例与代码请访问我们的项目页面:https://kyutai.org/casa。
English
Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .