**LLM2Vec-Gen：基于大型语言模型的生成式嵌入方法**

摘要

基于大语言模型的文本嵌入器通常对其输入的语义内容进行编码。然而，嵌入任务需要将多样化输入映射到相似输出。传统方法通常通过对比学习使用配对数据训练嵌入模型来解决这一输入输出映射问题。本研究提出了一种新颖的自监督方法LLM2Vec-Gen，它采用不同的范式：不是对输入进行编码，而是学习表示模型的潜在响应。具体而言，我们在LLM词表中添加可训练的特殊标记，将其附加到输入后，通过优化使这些标记能够以固定长度序列表示LLM的响应。训练过程由LLM自身对查询的补全结果指导，并结合提供蒸馏目标的非监督嵌入教师模型。这种设计有助于弥合输入输出差距，并将LLM的安全对齐、推理等能力迁移到嵌入任务中。关键的是，LLM主干网络保持冻结状态，且训练仅需未标注的查询数据。LLM2Vec-Gen在Massive文本嵌入基准测试（MTEB）中实现了最先进的非监督性能，较最佳非监督嵌入教师模型提升9.3%。我们还观察到嵌入任务中有害内容检索量减少达43.2%，推理能力提升29.3%。最终，学习得到的嵌入结果具有可解释性，可通过解码为文本来揭示其语义内容。

English

LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.