DiffCLIP:差分注意力机制与CLIP的融合
DiffCLIP: Differential Attention Meets CLIP
March 9, 2025
作者: Hasan Abed Al Kader Hammoud, Bernard Ghanem
cs.AI
摘要
我們提出了DiffCLIP,這是一種新穎的視覺-語言模型,它將差分注意力機制擴展到了CLIP架構中。差分注意力最初是為大型語言模型開發的,旨在放大相關上下文同時消除噪聲信息。在本研究中,我們將這一機制整合到CLIP的雙編碼器(圖像和文本)框架中。DiffCLIP僅需增加少量參數,便在圖像-文本理解任務上實現了卓越的性能。在零樣本分類、檢索和魯棒性基準測試中,DiffCLIP始終優於基礎CLIP模型。值得注意的是,這些性能提升伴隨著可忽略的計算開銷,表明差分注意力能夠在不犧牲效率的情況下顯著增強多模態表示。代碼可在https://github.com/hammoudhasan/DiffCLIP 找到。
English
We propose DiffCLIP, a novel vision-language model that extends the
differential attention mechanism to CLIP architectures. Differential attention
was originally developed for large language models to amplify relevant context
while canceling out noisy information. In this work, we integrate this
mechanism into CLIP's dual encoder (image and text) framework. With minimal
additional parameters, DiffCLIP achieves superior performance on image-text
understanding tasks. Across zero-shot classification, retrieval, and robustness
benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably,
these gains come with negligible computational overhead, demonstrating that
differential attention can significantly enhance multi-modal representations
without sacrificing efficiency. Code can be found at
https://github.com/hammoudhasan/DiffCLIP.Summary
AI-Generated Summary