MotionLLM:從人類動作和影片中理解人類行為
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
May 30, 2024
作者: Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang
cs.AI
摘要
本研究探討利用大型語言模型(LLMs)的強大能力,深入研究多模態(即視頻和動作模態)人類行為理解領域。與最近專為僅視頻或僅動作理解而設計的LLMs不同,我們認為理解人類行為需要從視頻和動作序列(例如SMPL序列)共同建模,以有效捕捉微妙的身體部位動態和語義。基於此,我們提出MotionLLM,這是一個直觀而有效的人體動作理解、標註和推理框架。具體而言,MotionLLM採用統一的視頻-動作訓練策略,利用現有粗糙的視頻-文本數據和細粒度的動作-文本數據的互補優勢,獲取豐富的時空洞察。此外,我們收集了一個龐大的數據集MoVid,其中包括多樣的視頻、動作、標題和指示。此外,我們提出了MoVid-Bench,配有精心編製的手動標註,以更好地評估視頻和動作上的人類行為理解。大量實驗表明MotionLLM在標題、時空理解和推理能力方面的優越性。
English
This study delves into the realm of multi-modality (i.e., video and motion
modalities) human behavior understanding by leveraging the powerful
capabilities of Large Language Models (LLMs). Diverging from recent LLMs
designed for video-only or motion-only understanding, we argue that
understanding human behavior necessitates joint modeling from both videos and
motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics
and semantics effectively. In light of this, we present MotionLLM, a
straightforward yet effective framework for human motion understanding,
captioning, and reasoning. Specifically, MotionLLM adopts a unified
video-motion training strategy that leverages the complementary advantages of
existing coarse video-text data and fine-grained motion-text data to glean rich
spatial-temporal insights. Furthermore, we collect a substantial dataset,
MoVid, comprising diverse videos, motions, captions, and instructions.
Additionally, we propose the MoVid-Bench, with carefully manual annotations,
for better evaluation of human behavior understanding on video and motion.
Extensive experiments show the superiority of MotionLLM in the caption,
spatial-temporal comprehension, and reasoning ability.Summary
AI-Generated Summary