MOSPA:空間音頻驅動的人體動作生成
MOSPA: Human Motion Generation Driven by Spatial Audio
July 16, 2025
作者: Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura
cs.AI
摘要
使虛擬人物能夠動態且真實地回應多樣化的聽覺刺激,仍然是角色動畫中的一個關鍵挑戰,這需要整合感知建模與運動合成技術。儘管其重要性不言而喻,這一任務在很大程度上仍未被充分探索。以往的研究大多專注於將語音、音頻和音樂等模態映射以生成人體運動。然而,這些模型通常忽略了空間音頻信號中編碼的空間特徵對人體運動的影響。為彌補這一空白並實現對空間音頻驅動的高質量人體運動建模,我們首次引入了全面的空間音頻驅動人體運動(SAM)數據集,該數據集包含了多樣化且高質量的空間音頻與運動數據。為了進行基準測試,我們開發了一個簡單而有效的基於擴散生成的人體運動生成框架,名為MOSPA(由空間音頻驅動的人體運動生成),它通過有效的融合機制忠實地捕捉了身體運動與空間音頻之間的關係。一旦訓練完成,MOSPA能夠根據不同的空間音頻輸入生成多樣化的逼真人體運動。我們對所提出的數據集進行了深入的研究,並進行了廣泛的實驗以進行基準測試,在此任務中,我們的方法達到了最先進的性能。我們的模型和數據集將在論文被接受後開源。更多詳情,請參閱我們的補充視頻。
English
Enabling virtual humans to dynamically and realistically respond to diverse
auditory stimuli remains a key challenge in character animation, demanding the
integration of perceptual modeling and motion synthesis. Despite its
significance, this task remains largely unexplored. Most previous works have
primarily focused on mapping modalities like speech, audio, and music to
generate human motion. As of yet, these models typically overlook the impact of
spatial features encoded in spatial audio signals on human motion. To bridge
this gap and enable high-quality modeling of human movements in response to
spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human
Motion (SAM) dataset, which contains diverse and high-quality spatial audio and
motion data. For benchmarking, we develop a simple yet effective
diffusion-based generative framework for human MOtion generation driven by
SPatial Audio, termed MOSPA, which faithfully captures the relationship between
body motion and spatial audio through an effective fusion mechanism. Once
trained, MOSPA could generate diverse realistic human motions conditioned on
varying spatial audio inputs. We perform a thorough investigation of the
proposed dataset and conduct extensive experiments for benchmarking, where our
method achieves state-of-the-art performance on this task. Our model and
dataset will be open-sourced upon acceptance. Please refer to our supplementary
video for more details.