당신의 임베딩 모델은 생각보다 더 똑똑합니다.

초록

다중 모달 검색은 주로 단일 벡터 검색기에 의존하는데, 이는 풍부한 순차적 토큰 시퀀스를 하나의 단일 전역 표현으로 압축한다. 효율적이기는 하지만, 밀집 검색 작업에 중요한 세부적인 지역 증거를 버리게 된다. 이러한 문제를 해결하기 위해 다중 벡터 접근 방식이 도입되었지만, 이는 엄격하게 학습을 필요로 하며, 많은 경우 전역 요약 표현의 필요성을 무시한다. 이에 대응하여, 우리는 표준 단일 벡터 모델의 잠재된 다중 벡터 능력을 활성화하는 프레임워크인 SMART를 제안한다. 먼저, 풀링 임베딩에 대한 표준 대조 학습이 그래디언트 흐름을 통해 이전 은닉 상태의 검색 기하 구조를 암묵적으로 형성함을 보여준다. 추론 중에 이러한 고정된 은닉 상태에 직접적인 늦은 상호작용을 적용함으로써 SMART는 플러그 앤 플레이 업그레이드 역할을 하여 다양한 모달리티에서 일관된 성능 향상을 제공하며, MMEB-V2에서 최첨단 모델까지도 더욱 개선한다. 또한, SMART의 우수한 성능을 밝히는데, 간단한 경량 사후 학습은 시간과 계산을 절약할 뿐만 아니라 시각 문서 검색에서 추가적인 개선을 가져와 단일 벡터 모델이 최첨단 다중 벡터 대응 모델을 능가할 수 있게 한다. 궁극적으로 SMART는 다중 모달 검색을 위한 매우 효율적인 추론 향상 기법이자 강력한 미세 조정 기법을 동시에 제공한다. 우리는 코드와 가중치를 https://github.com/HanSolo9682/SMART에서 오픈소스로 공개한다.

English

Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.