SemBridge：スパースエンコーダにおける多言語意味ブリッジを用いた言語転移

要旨

スパースエンコーダは、語彙空間内での用語の重要度を表現することで高精度な検索を実現するが、その英語中心の構造は非英語言語への言語転移にとって重大な障害となる。この構造的制限を克服するために、我々は多言語ブリッジモデルを活用したスパースエンコーダにおける言語横断適応のための新しい埋め込み初期化手法SemBridgeを提案する。SemBridgeは、多言語の密埋め込みをブリッジとして用い、ソース語彙とターゲット語彙間の意味的整合性を確立する。すべてのソーストークンに直接依存するのではなく、SemBridgeは意味的に関連する少数のソース言語トークンを選択し、それらを用いて各ターゲット言語トークンを初期化することで、意味的ノイズを効果的に除去し、ターゲットトークンをコア同義語の精密な線形結合として再構築する。これにより、ファインチューニング中の収束を加速し、訓練効率を向上させる。5つの言語と4つのスパースアーキテクチャにわたる広範な実験により、SemBridgeが優れたゼロショット検索性能を達成し、既存のベースラインと比較してファインチューニング後の検索性能を一貫して向上させることが実証された。これらの結果は、SemBridgeが多様な言語環境で高性能なスパース検索システムを展開するための実用的なソリューションであることを検証している。

English

Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.