ChatPaper.aiChatPaper

Llama-Embed-Nemotron-8B:面向多语言与跨语言任务的通用文本嵌入模型

Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks

November 10, 2025
作者: Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, Even Oldridge
cs.AI

摘要

我们正式推出llama-embed-nemotron-8b——一款开源权重的文本嵌入模型。截至2025年10月21日,该模型在多语言大规模文本嵌入基准(MMTEB)排行榜上实现了最先进的性能表现。尽管当前主流模型展现出强劲性能,但其训练数据与方法论往往未完全公开。为此,我们通过开发完全开源的模型、公开其权重与详细消融研究,并计划分享精编训练数据集,致力于解决这一问题。该模型在所有核心嵌入任务(包括检索、分类和语义文本相似度STS)中均表现卓越,尤其在低资源语言和跨语言设置等复杂多语言场景中表现突出。这一顶尖性能得益于我们创新的数据组合策略:1610万组查询-文档对中,770万样本来自公开数据集,840万则通过各类开源大语言模型合成生成。我们的核心贡献之一是通过详细消融研究分析了关键设计选择,包括对比损失实现的比较、合成数据生成策略的评估,以及模型融合的影响分析。作为指令感知模型,llama-embed-nemotron-8b支持用户自定义指令以优化特定场景性能。这种顶尖性能、广泛适用性与用户驱动灵活性的结合,使其成为通用文本嵌入的理想解决方案。
English
We introduce llama-embed-nemotron-8b, an open-weights text embedding model that achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent models show strong performance, their training data or methodologies are often not fully disclosed. We aim to address this by developing a fully open-source model, publicly releasing its weights and detailed ablation studies, and planning to share the curated training datasets. Our model demonstrates superior performance across all major embedding tasks -- including retrieval, classification and semantic textual similarity (STS) -- and excels in challenging multilingual scenarios, such as low-resource languages and cross-lingual setups. This state-of-the-art performance is driven by a novel data mix of 16.1 million query-document pairs, split between 7.7 million samples from public datasets and 8.4 million synthetically generated examples from various open-weight LLMs. One of our key contributions is a detailed ablation study analyzing core design choices, including a comparison of contrastive loss implementations, an evaluation of synthetic data generation (SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b is an instruction-aware model, supporting user-defined instructions to enhance performance for specific use-cases. This combination of top-tier performance, broad applicability, and user-driven flexibility enables it to serve as a universal text embedding solution.
PDF112December 2, 2025