SonicSim:用于移动声源场景中语音处理的可定制仿真平台
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios
October 2, 2024
作者: Kai Li, Wendi Sang, Chang Zeng, Runxuan Yang, Guo Chen, Xiaolin Hu
cs.AI
摘要
在移动声源条件下系统评估语音分离和增强模型通常需要包含各种场景的大量数据。然而,真实世界的数据集通常包含的数据不足以满足模型的训练和评估需求。尽管合成数据集提供了更多的数据量,但它们的声学模拟缺乏真实感。因此,无论是真实世界还是合成数据集都无法有效满足实际需求。为了解决这些问题,我们介绍了SonicSim,这是一个合成工具包,旨在为移动声源生成高度可定制的数据。SonicSim基于具有多级调整功能的体验式人工智能模拟平台Habitat-sim开发,包括场景级、麦克风级和声源级,从而生成更多样化的合成数据。利用SonicSim,我们构建了一个移动声源基准数据集SonicSet,使用了Librispeech、Freesound数据集50k(FSD50K)和Free Music Archive(FMA),以及来自Matterport3D的90个场景,用于评估语音分离和增强模型。此外,为了验证合成数据和真实世界数据之间的差异,我们从SonicSet验证集中随机选择了5小时无混响的原始数据,录制了一个真实世界的语音分离数据集,然后与相应的合成数据集进行比较。类似地,我们利用真实世界的语音增强数据集RealMAN验证了其他合成数据集与SonicSet数据集之间的声学差距。结果表明,SonicSim生成的合成数据能够有效地推广到真实世界场景。演示和代码可在https://cslikai.cn/SonicSim/公开获取。
English
The systematic evaluation of speech separation and enhancement models under
moving sound source conditions typically requires extensive data comprising
diverse scenarios. However, real-world datasets often contain insufficient data
to meet the training and evaluation requirements of models. Although synthetic
datasets offer a larger volume of data, their acoustic simulations lack
realism. Consequently, neither real-world nor synthetic datasets effectively
fulfill practical needs. To address these issues, we introduce SonicSim, a
synthetic toolkit de-designed to generate highly customizable data for moving
sound sources. SonicSim is developed based on the embodied AI simulation
platform, Habitat-sim, supporting multi-level adjustments, including
scene-level, microphone-level, and source-level, thereby generating more
diverse synthetic data. Leveraging SonicSim, we constructed a moving sound
source benchmark dataset, SonicSet, using the Librispeech, the Freesound
Dataset 50k (FSD50K) and Free Music Archive (FMA), and 90 scenes from the
Matterport3D to evaluate speech separation and enhancement models.
Additionally, to validate the differences between synthetic data and real-world
data, we randomly selected 5 hours of raw data without reverberation from the
SonicSet validation set to record a real-world speech separation dataset, which
was then compared with the corresponding synthetic datasets. Similarly, we
utilized the real-world speech enhancement dataset RealMAN to validate the
acoustic gap between other synthetic datasets and the SonicSet dataset for
speech enhancement. The results indicate that the synthetic data generated by
SonicSim can effectively generalize to real-world scenarios. Demo and code are
publicly available at https://cslikai.cn/SonicSim/.Summary
AI-Generated Summary