이중 정렬 사전 학습을 통한 크로스링구얼 문장 임베딩

초록

최근 연구에 따르면, 문장 수준의 번역 순위 지정 작업으로 학습된 듀얼 인코더 모델이 교차 언어 문장 임베딩에 효과적인 방법으로 나타났습니다. 그러나 우리의 연구는 다국어 시나리오에서 토큰 수준 정렬 또한 중요하며, 이는 이전에 충분히 탐구되지 않았음을 보여줍니다. 이러한 발견을 바탕으로, 우리는 문장 수준과 토큰 수준 정렬을 모두 통합한 교차 언어 문장 임베딩을 위한 이중 정렬 사전 학습(DAP) 프레임워크를 제안합니다. 이를 위해, 모델이 한쪽의 문맥화된 토큰 표현을 사용하여 번역된 상대방을 재구성하도록 학습하는 새로운 표현 번역 학습(RTL) 작업을 도입했습니다. 이 재구성 목표는 모델이 토큰 표현에 번역 정보를 임베딩하도록 장려합니다. 번역 언어 모델링과 같은 다른 토큰 수준 정렬 방법과 비교할 때, RTL은 듀얼 인코더 아키텍처에 더 적합하며 계산적으로 효율적입니다. 세 가지 문장 수준의 교차 언어 벤치마크에서의 광범위한 실험은 우리의 접근 방식이 문장 임베딩을 크게 개선할 수 있음을 입증합니다. 우리의 코드는 https://github.com/ChillingDream/DAP에서 확인할 수 있습니다.

English

Recent studies have shown that dual encoder models trained with the sentence-level translation ranking task are effective methods for cross-lingual sentence embedding. However, our research indicates that token-level alignment is also crucial in multilingual scenarios, which has not been fully explored previously. Based on our findings, we propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding that incorporates both sentence-level and token-level alignment. To achieve this, we introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart. This reconstruction objective encourages the model to embed translation information into the token representation. Compared to other token-level alignment methods such as translation language modeling, RTL is more suitable for dual encoder architectures and is computationally efficient. Extensive experiments on three sentence-level cross-lingual benchmarks demonstrate that our approach can significantly improve sentence embedding. Our code is available at https://github.com/ChillingDream/DAP.

이중 정렬 사전 학습을 통한 크로스링구얼 문장 임베딩

Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

초록

Support