密集運動描述
Dense Motion Captioning
November 7, 2025
作者: Shiyao Xu, Benedetta Liberatori, Gül Varol, Paolo Rota
cs.AI
摘要
近期三維人體動作與語言整合的研究主要聚焦於文字生成動作,使得動作理解任務相對未被充分探索。我們提出密集動作描述這一新穎任務,旨在對三維人體動作序列中的行為進行時間定位與文字描述。現有數據集普遍缺乏細粒度時間標註,且主要由包含少量動作的短序列構成。為突破這些限制,我們推出複雜動作數據集(CompMo),這是首個大規模包含精確時間邊界的複雜動作序列數據集,具有豐富標註。通過精心設計的數據生成流程,CompMo包含60,000個動作序列,每個序列由至少2至10個複合動作構成,並精確標註其時間區間。我們進一步提出DEMO模型,該模型將大型語言模型與簡潔的動作適配器相結合,經訓練可生成具時間錨點的密集描述。實驗結果表明,DEMO在CompMo及適應性基準測試中均顯著超越現有方法,為未來三維動作理解與描述研究建立了堅實基準。
English
Recent advances in 3D human motion and language integration have primarily
focused on text-to-motion generation, leaving the task of motion understanding
relatively unexplored. We introduce Dense Motion Captioning, a novel task that
aims to temporally localize and caption actions within 3D human motion
sequences. Current datasets fall short in providing detailed temporal
annotations and predominantly consist of short sequences featuring few actions.
To overcome these limitations, we present the Complex Motion Dataset (CompMo),
the first large-scale dataset featuring richly annotated, complex motion
sequences with precise temporal boundaries. Built through a carefully designed
data generation pipeline, CompMo includes 60,000 motion sequences, each
composed of multiple actions ranging from at least two to ten, accurately
annotated with their temporal extents. We further present DEMO, a model that
integrates a large language model with a simple motion adapter, trained to
generate dense, temporally grounded captions. Our experiments show that DEMO
substantially outperforms existing methods on CompMo as well as on adapted
benchmarks, establishing a robust baseline for future research in 3D motion
understanding and captioning.