3D-Speaker: 음성 표현 분리를 위한 대규모 다중 장치, 다중 거리, 다중 방언 코퍼스

초록

음성 발화에서 상관없는 정보를 분리해내는 것은 음성 연구 커뮤니티 내에서 중요한 연구 주제이다. 다양한 음성 관련 작업들은 서로 다른 음성 표현을 추출하면서도 다른 상관없는 정보의 영향을 최소화하는 데 초점을 맞추고 있다. 본 논문에서는 음성 표현 분리 연구를 촉진하기 위해 대규모 음성 코퍼스를 소개한다. 3D-Speaker는 10,000명 이상의 화자를 포함하며, 각 화자는 여러 장치(Devices)로 동시에 녹음되고, 서로 다른 거리(Distances)에서 위치하며, 일부 화자는 여러 방언(Dialects)을 사용한다. 이러한 다차원 오디오 데이터의 통제된 조합은 다양한 음성 표현 얽힘의 행렬을 생성하여 이를 해결하기 위한 흥미로운 방법들을 유도한다. 3D-Speaker의 다중 도메인 특성은 또한 대규모 범용 음성 모델을 평가하고, 도메인 외 학습 및 자기 지도 학습 방법을 실험하기에 적합한 자원으로 활용될 수 있다. https://3dspeaker.github.io/

English

Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community. Different speech-related tasks focus on extracting distinct speech representations while minimizing the affects of other uncorrelated information. We present a large-scale speech corpus to facilitate the research of speech representation disentanglement. 3D-Speaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects. The controlled combinations of multi-dimensional audio data yield a matrix of a diverse blend of speech representation entanglement, thereby motivating intriguing methods to untangle them. The multi-domain nature of 3D-Speaker also makes it a suitable resource to evaluate large universal speech models and experiment methods of out-of-domain learning and self-supervised learning. https://3dspeaker.github.io/

3D-Speaker: 음성 표현 분리를 위한 대규모 다중 장치, 다중 거리, 다중 방언 코퍼스

3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

초록

Support