密集運動描述

摘要

近期三維人體動作與語言整合的研究主要聚焦於文字生成動作，使得動作理解任務相對未被充分探索。我們提出密集動作描述這一新穎任務，旨在對三維人體動作序列中的行為進行時間定位與文字描述。現有數據集普遍缺乏細粒度時間標註，且主要由包含少量動作的短序列構成。為突破這些限制，我們推出複雜動作數據集（CompMo），這是首個大規模包含精確時間邊界的複雜動作序列數據集，具有豐富標註。通過精心設計的數據生成流程，CompMo包含60,000個動作序列，每個序列由至少2至10個複合動作構成，並精確標註其時間區間。我們進一步提出DEMO模型，該模型將大型語言模型與簡潔的動作適配器相結合，經訓練可生成具時間錨點的密集描述。實驗結果表明，DEMO在CompMo及適應性基準測試中均顯著超越現有方法，為未來三維動作理解與描述研究建立了堅實基準。

English

Recent advances in 3D human motion and language integration have primarily focused on text-to-motion generation, leaving the task of motion understanding relatively unexplored. We introduce Dense Motion Captioning, a novel task that aims to temporally localize and caption actions within 3D human motion sequences. Current datasets fall short in providing detailed temporal annotations and predominantly consist of short sequences featuring few actions. To overcome these limitations, we present the Complex Motion Dataset (CompMo), the first large-scale dataset featuring richly annotated, complex motion sequences with precise temporal boundaries. Built through a carefully designed data generation pipeline, CompMo includes 60,000 motion sequences, each composed of multiple actions ranging from at least two to ten, accurately annotated with their temporal extents. We further present DEMO, a model that integrates a large language model with a simple motion adapter, trained to generate dense, temporally grounded captions. Our experiments show that DEMO substantially outperforms existing methods on CompMo as well as on adapted benchmarks, establishing a robust baseline for future research in 3D motion understanding and captioning.