ChatPaper.aiChatPaper

密集运动描述

Dense Motion Captioning

November 7, 2025
作者: Shiyao Xu, Benedetta Liberatori, Gül Varol, Paolo Rota
cs.AI

摘要

近期三维人体运动与语言融合的研究主要集中于文本到动作生成领域,而动作理解任务尚未得到充分探索。我们提出密集运动描述这一新任务,旨在对三维人体运动序列中的动作进行时序定位与描述。现有数据集普遍存在时序标注细节不足的问题,且多以包含少量动作的短序列为主。为突破这些局限,我们推出复杂运动数据集CompMo——首个具备精细时序标注的大规模复杂运动序列数据集。通过精心设计的数据生成流程,CompMo包含6万条运动序列,每条序列由至少2个至多10个动作组成,并配有精确的时序边界标注。我们进一步提出DEMO模型,该模型通过简单运动适配器整合大语言模型,可生成具有时序定位的密集描述。实验表明,DEMO在CompMo及适配基准测试中显著超越现有方法,为三维运动理解与描述研究建立了稳健基线。
English
Recent advances in 3D human motion and language integration have primarily focused on text-to-motion generation, leaving the task of motion understanding relatively unexplored. We introduce Dense Motion Captioning, a novel task that aims to temporally localize and caption actions within 3D human motion sequences. Current datasets fall short in providing detailed temporal annotations and predominantly consist of short sequences featuring few actions. To overcome these limitations, we present the Complex Motion Dataset (CompMo), the first large-scale dataset featuring richly annotated, complex motion sequences with precise temporal boundaries. Built through a carefully designed data generation pipeline, CompMo includes 60,000 motion sequences, each composed of multiple actions ranging from at least two to ten, accurately annotated with their temporal extents. We further present DEMO, a model that integrates a large language model with a simple motion adapter, trained to generate dense, temporally grounded captions. Our experiments show that DEMO substantially outperforms existing methods on CompMo as well as on adapted benchmarks, establishing a robust baseline for future research in 3D motion understanding and captioning.
PDF92December 2, 2025