雙重對齊預訓練用於跨語言句子嵌入

摘要

最近的研究顯示，使用句級別翻譯排序任務訓練的雙編碼模型是跨語言句子嵌入的有效方法。然而，我們的研究表明，在多語境情況下，令牌級別的對齊也是至關重要的，這在先前尚未得到充分探索。根據我們的發現，我們提出了一個雙對齊預訓練（DAP）框架，用於跨語言句子嵌入，該框架結合了句級別和令牌級別的對齊。為了實現這一目標，我們引入了一個新的表示翻譯學習（RTL）任務，模型學習使用單邊上下文化的令牌表示來重建其翻譯對應物。這種重建目標鼓勵模型將翻譯信息嵌入到令牌表示中。與其他令牌級別對齊方法（如翻譯語言建模）相比，RTL 更適用於雙編碼架構，並且在計算上更有效。對三個句級跨語言基準上的大量實驗表明，我們的方法可以顯著改善句子嵌入。我們的程式碼可在 https://github.com/ChillingDream/DAP 找到。

English

Recent studies have shown that dual encoder models trained with the sentence-level translation ranking task are effective methods for cross-lingual sentence embedding. However, our research indicates that token-level alignment is also crucial in multilingual scenarios, which has not been fully explored previously. Based on our findings, we propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding that incorporates both sentence-level and token-level alignment. To achieve this, we introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart. This reconstruction objective encourages the model to embed translation information into the token representation. Compared to other token-level alignment methods such as translation language modeling, RTL is more suitable for dual encoder architectures and is computationally efficient. Extensive experiments on three sentence-level cross-lingual benchmarks demonstrate that our approach can significantly improve sentence embedding. Our code is available at https://github.com/ChillingDream/DAP.

雙重對齊預訓練用於跨語言句子嵌入

Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

摘要

Support