ChatPaper.aiChatPaper

Llama-Embed-Nemotron-8B:面向多语言与跨语言任务的通用文本嵌入模型

Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks

November 10, 2025
作者: Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, Even Oldridge
cs.AI

摘要

我们推出llama-embed-nemotron-8b——一款开源权重的文本嵌入模型,该模型截至2025年10月21日在多语言海量文本嵌入基准(MMTEB)排行榜上实现了最先进的性能。尽管近期模型展现出强劲表现,但其训练数据与方法论往往未完全公开。为此,我们通过开发完全开源的模型、公开其权重与详细消融研究,并计划分享精编训练数据集来解决这一问题。我们的模型在所有主流嵌入任务(包括检索、分类和语义文本相似度STS)中均表现卓越,尤其在低资源语言和跨语言设置等复杂多语言场景下优势显著。这一顶尖性能得益于新颖的1610万查询-文档对数据组合,其中770万样本来自公共数据集,840万则通过各类开源大语言模型合成生成。我们的核心贡献之一是通过详细消融研究分析了关键设计选择,包括对比损失实现的比较、合成数据生成策略评估以及模型融合的影响。llama-embed-nemotron-8b作为指令感知模型,支持用户自定义指令以增强特定用例的性能。这种顶尖性能、广泛适用性与用户驱动灵活性的结合,使其能够成为通用文本嵌入解决方案。
English
We introduce llama-embed-nemotron-8b, an open-weights text embedding model that achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent models show strong performance, their training data or methodologies are often not fully disclosed. We aim to address this by developing a fully open-source model, publicly releasing its weights and detailed ablation studies, and planning to share the curated training datasets. Our model demonstrates superior performance across all major embedding tasks -- including retrieval, classification and semantic textual similarity (STS) -- and excels in challenging multilingual scenarios, such as low-resource languages and cross-lingual setups. This state-of-the-art performance is driven by a novel data mix of 16.1 million query-document pairs, split between 7.7 million samples from public datasets and 8.4 million synthetically generated examples from various open-weight LLMs. One of our key contributions is a detailed ablation study analyzing core design choices, including a comparison of contrastive loss implementations, an evaluation of synthetic data generation (SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b is an instruction-aware model, supporting user-defined instructions to enhance performance for specific use-cases. This combination of top-tier performance, broad applicability, and user-driven flexibility enables it to serve as a universal text embedding solution.
PDF112December 2, 2025