ChatPaper.aiChatPaper

擴散模型與自回歸語言模型:文本嵌入視角的比較

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

May 21, 2025
作者: Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao
cs.AI

摘要

基於大型語言模型(LLM)的嵌入模型,得益於大規模的預訓練和後續訓練,已開始在文件檢索等通用文本嵌入任務上超越基於BERT和T5的模型。然而,LLM嵌入的一個根本限制在於自回歸預訓練期間使用的單向注意力機制,這與文本嵌入任務的雙向性質不符。為此,我們提出採用擴散語言模型進行文本嵌入,這受到其固有的雙向架構及近期在推理任務上匹配或超越LLM的成功所啟發。我們首次系統性地研究了擴散語言嵌入模型,其在長文件檢索上比基於LLM的嵌入模型提升了20%,在推理密集型檢索上提升了8%,在指令遵循檢索上提升了2%,並在傳統文本嵌入基準測試中取得了競爭力的表現。我們的分析證實,雙向注意力對於編碼長且複雜文本中的全局上下文至關重要。
English
Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

Summary

AI-Generated Summary

PDF452May 22, 2025