SemBridge：通过多语言语义桥的稀疏编码器语言迁移

摘要

稀疏编码器通过在词汇空间中表示词汇重要性来实现高精度检索，但其以英语为中心的结构对非英语语言的迁移构成了关键性障碍。为克服这一结构性限制，我们提出SemBridge——一种新颖的嵌入初始化方法，通过利用多语言桥接模型实现稀疏编码器的跨语言适应。SemBridge以多语言稠密嵌入为桥梁，在源语言与目标语言词汇之间建立语义对齐。不同于直接依赖所有源语言词元，SemBridge选取少量语义相关的源语言词元，并利用它们初始化每个目标语言词元，从而有效过滤语义噪声，将目标词元重构为核心同义词的精确线性组合。这加速了微调过程中的收敛并提升训练效率。涵盖五种语言和四种稀疏架构的大量实验表明，与现有基线方法相比，SemBridge在零样本检索中表现更优，且微调后持续提升检索性能。这些结果验证了SemBridge作为在不同语言环境中部署高性能稀疏检索系统的实用解决方案。

English

Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.