SemBridge：通过多语言语义桥梁实现稀疏编码器的语言迁移

摘要

稀疏編碼器藉由詞彙空間中詞項重要性的表示，實現高精度檢索，然而其以英語為中心的結構對非英語語言的語言遷移構成關鍵障礙。為克服此結構限制，我們提出SemBridge——一種專為稀疏編碼器跨語言適應設計的新穎嵌入初始化方法，該方法利用多語言橋接模型建立源語言與目標語言詞彙間的語義對齊。SemBridge並非直接依賴所有源語言詞元，而是選取少量語義相關的源語言詞元，並以此初始化每個目標語言詞元，從而有效過濾語義雜訊，將目標詞元重建為核心同義詞的精確線性組合。此舉不僅加速微調收斂，亦提升訓練效率。在五種語言與四種稀疏架構上的廣泛實驗表明，SemBridge在零樣本檢索表現上優於現有基準，並在微調後持續提升檢索效能。這些結果驗證SemBridge為在多語言環境中部署高效能稀疏檢索系統的實用解決方案。

English

Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.