DiffCLIP: 차별적 주의 메커니즘과 CLIP의 만남

초록

우리는 차별적 주의 메커니즘을 CLIP 아키텍처로 확장한 새로운 비전-언어 모델인 DiffCLIP를 제안한다. 차별적 주의는 원래 대규모 언어 모델을 위해 개발되어 관련 컨텍스트를 증폭시키는 동시에 잡음 정보를 제거하는 데 사용되었다. 본 연구에서는 이 메커니즘을 CLIP의 이중 인코더(이미지와 텍스트) 프레임워크에 통합한다. 최소한의 추가 파라미터만으로 DiffCLIP는 이미지-텍스트 이해 작업에서 우수한 성능을 달성한다. 제로샷 분류, 검색 및 견고성 벤치마크 전반에 걸쳐 DiffCLIP는 기준 CLIP 모델을 지속적으로 능가한다. 특히, 이러한 성능 향상은 계산 오버헤드를 거의 발생시키지 않으면서 이루어지며, 차별적 주의가 효율성을 희생하지 않고도 다중 모달 표현을 크게 향상시킬 수 있음을 보여준다. 코드는 https://github.com/hammoudhasan/DiffCLIP에서 확인할 수 있다.

English

We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at https://github.com/hammoudhasan/DiffCLIP.

DiffCLIP: 차별적 주의 메커니즘과 CLIP의 만남

DiffCLIP: Differential Attention Meets CLIP

초록

Support