空间语音翻译：利用双耳可穿戴设备实现跨空间翻译

摘要

设想身处一个人声鼎沸的异国空间，佩戴的智能耳机能将周围的声音实时转化为您的母语，同时保留每位说话者的空间方位信息。我们提出了空间语音翻译这一创新概念，旨在通过智能耳机翻译佩戴者环境中的语音，并在双耳输出中维持每位说话者的方向感与独特音色特征。为实现这一目标，我们攻克了多项技术难题，包括盲源分离、声源定位、实时情感丰富的翻译以及双耳渲染，确保翻译后的音频中说话者的方向信息得以保留，并在Apple M2芯片上实现了实时推理。通过原型双耳耳机的概念验证评估，我们展示了与现有模型在干扰环境下表现不佳不同，即便在环境中存在强烈说话者干扰的情况下，我们的系统在语言间翻译时仍能达到高达22.01的BLEU分数。用户研究进一步证实了该系统在未经预见的现实混响环境中，能够有效实现翻译语音的空间渲染。从更宏观的视角来看，这项工作标志着将空间感知融入语音翻译领域迈出了重要的一步。

English

Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system's effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.