확산 모델 대 자동회귀 언어 모델: 텍스트 임베딩 관점에서의 비교

초록

대규모 언어 모델(LLM) 기반 임베딩 모델은 대규모 사전 학습과 사후 학습을 통해 문서 검색과 같은 일반적인 텍스트 임베딩 작업에서 BERT 및 T5 기반 모델을 능가하기 시작했습니다. 그러나 LLM 임베딩의 근본적인 한계는 자동회귀적 사전 학습 과정에서 사용되는 단방향 어텐션에 있으며, 이는 텍스트 임베딩 작업의 양방향 특성과 맞지 않습니다. 이를 해결하기 위해, 우리는 특히 추론 작업에서 LLM을 능가하거나 동등한 성능을 보인 최근의 성공 사례와 양방향 아키텍처를 고려하여 확산 언어 모델을 텍스트 임베딩에 적용할 것을 제안합니다. 우리는 확산 언어 임베딩 모델에 대한 첫 번째 체계적인 연구를 제시하며, 이 모델은 장문 문서 검색에서 20%, 추론 집약적 검색에서 8%, 지시 따르기 검색에서 2% 더 나은 성능을 보였고, 전통적인 텍스트 임베딩 벤치마크에서도 경쟁력 있는 성능을 달성했습니다. 우리의 분석은 긴 복잡한 텍스트에서 전역 컨텍스트를 인코딩하는 데 양방향 어텐션이 중요함을 검증합니다.

English

Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

확산 모델 대 자동회귀 언어 모델: 텍스트 임베딩 관점에서의 비교

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

초록

Support