扩散模型与自回归语言模型:文本嵌入视角的对比
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
May 21, 2025
作者: Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao
cs.AI
摘要
基于大规模预训练和后训练的大型语言模型(LLM)嵌入模型,在诸如文档检索等通用文本嵌入任务上已开始超越基于BERT和T5的模型。然而,LLM嵌入的一个根本局限在于自回归预训练期间使用的单向注意力机制,这与文本嵌入任务的双向特性存在偏差。为此,我们提出采用扩散语言模型进行文本嵌入,其动机在于其固有的双向架构以及在推理任务上匹配甚至超越LLM的最新成功。我们首次系统性地研究了扩散语言嵌入模型,该模型在长文档检索上比LLM嵌入模型高出20%,在推理密集型检索上高出8%,在指令跟随检索上高出2%,并在传统文本嵌入基准测试中取得了具有竞争力的表现。我们的分析证实,双向注意力机制对于编码长且复杂文本的全局上下文至关重要。
English
Large language model (LLM)-based embedding models, benefiting from large
scale pre-training and post-training, have begun to surpass BERT and T5-based
models on general-purpose text embedding tasks such as document retrieval.
However, a fundamental limitation of LLM embeddings lies in the unidirectional
attention used during autoregressive pre-training, which misaligns with the
bidirectional nature of text embedding tasks. To this end, We propose adopting
diffusion language models for text embeddings, motivated by their inherent
bidirectional architecture and recent success in matching or surpassing LLMs
especially on reasoning tasks. We present the first systematic study of the
diffusion language embedding model, which outperforms the LLM-based embedding
model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval,
2% on instruction-following retrieval, and achieve competitive performance on
traditional text embedding benchmarks. Our analysis verifies that bidirectional
attention is crucial for encoding global context in long and complex text.Summary
AI-Generated Summary