扩散模型与自回归语言模型：文本嵌入视角的对比

摘要

基于大规模预训练和后训练的大型语言模型（LLM）嵌入模型，在诸如文档检索等通用文本嵌入任务上已开始超越基于BERT和T5的模型。然而，LLM嵌入的一个根本局限在于自回归预训练期间使用的单向注意力机制，这与文本嵌入任务的双向特性存在偏差。为此，我们提出采用扩散语言模型进行文本嵌入，其动机在于其固有的双向架构以及在推理任务上匹配甚至超越LLM的最新成功。我们首次系统性地研究了扩散语言嵌入模型，该模型在长文档检索上比LLM嵌入模型高出20%，在推理密集型检索上高出8%，在指令跟随检索上高出2%，并在传统文本嵌入基准测试中取得了具有竞争力的表现。我们的分析证实，双向注意力机制对于编码长且复杂文本的全局上下文至关重要。

English

Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

扩散模型与自回归语言模型：文本嵌入视角的对比

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

摘要

Support