MOSPA：空间音频驱动的人体运动生成

摘要

让虚拟人物能够动态且真实地响应多样化的听觉刺激，仍然是角色动画领域的一个关键挑战，这需要将感知建模与运动合成相结合。尽管这一任务具有重要意义，但相关研究仍处于初步探索阶段。以往的研究主要集中在将语音、音频和音乐等模态映射以生成人体运动。然而，这些模型通常忽视了空间音频信号中编码的空间特征对人体运动的影响。为了填补这一空白，并实现对空间音频驱动下人体运动的高质量建模，我们首次引入了全面的空间音频驱动人体运动（SAM）数据集，该数据集包含了多样化的高质量空间音频与运动数据。为了进行基准测试，我们开发了一个简单而有效的基于扩散的生成框架，名为MOSPA（空间音频驱动的人体运动生成），它通过高效的融合机制，准确捕捉了身体运动与空间音频之间的关系。训练完成后，MOSPA能够根据不同的空间音频输入生成多样且逼真的人体运动。我们对所提出的数据集进行了深入研究，并进行了广泛的实验以进行基准测试，结果表明我们的方法在该任务上达到了最先进的性能。我们的模型和数据集将在论文被接受后开源。更多详情请参阅我们的补充视频。

English

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our model and dataset will be open-sourced upon acceptance. Please refer to our supplementary video for more details.

MOSPA：空间音频驱动的人体运动生成

MOSPA: Human Motion Generation Driven by Spatial Audio

摘要

Support