SemBridge:通过多语言语义桥的稀疏编码器语言迁移
SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges
May 25, 2026
作者: Seongtae Hong, Youngjoon Jang, Jia-Heui Ju, Hyeonseok Moon, Heuiseok Lim
cs.AI
摘要
稀疏编码器通过在词汇空间中表示词汇重要性来实现高精度检索,但其以英语为中心的结构对非英语语言的迁移构成了关键性障碍。为克服这一结构性限制,我们提出SemBridge——一种新颖的嵌入初始化方法,通过利用多语言桥接模型实现稀疏编码器的跨语言适应。SemBridge以多语言稠密嵌入为桥梁,在源语言与目标语言词汇之间建立语义对齐。不同于直接依赖所有源语言词元,SemBridge选取少量语义相关的源语言词元,并利用它们初始化每个目标语言词元,从而有效过滤语义噪声,将目标词元重构为核心同义词的精确线性组合。这加速了微调过程中的收敛并提升训练效率。涵盖五种语言和四种稀疏架构的大量实验表明,与现有基线方法相比,SemBridge在零样本检索中表现更优,且微调后持续提升检索性能。这些结果验证了SemBridge作为在不同语言环境中部署高性能稀疏检索系统的实用解决方案。
English
Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.