空間語音翻譯：利用雙耳可穿戴設備實現跨空間翻譯

摘要

設想身處一個人群熙攘的空間，周圍的人說著不同的語言，而你佩戴的聽覺設備能將這片聽覺空間轉化為你的母語，同時保留所有說話者的空間線索。我們引入了空間語音翻譯這一新穎概念，專為聽覺設備設計，能夠翻譯佩戴者環境中的說話者，並在雙耳輸出中保持每位說話者的方向和獨特聲音特徵。為實現這一目標，我們攻克了多項技術難題，包括盲源分離、定位、實時富有表現力的翻譯以及雙耳渲染，以確保翻譯後的音頻中說話者的方向得以保留，並在Apple M2芯片上實現實時推理。通過原型雙耳耳機的概念驗證評估顯示，與現有模型在干擾存在時失效不同，我們在語言間翻譯時，即便環境中存在強烈的其他說話者干擾，仍能達到高達22.01的BLEU分數。用戶研究進一步證實了該系統在先前未見的現實世界混響環境中，對翻譯語音進行空間渲染的有效性。回顧而言，這項工作標誌著將空間感知融入語音翻譯的第一步。

English

Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system's effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.