ChatPaper.aiChatPaper

MotionLLM:从人类动作和视频中理解人类行为

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

May 30, 2024
作者: Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang
cs.AI

摘要

本研究探讨了利用大型语言模型(LLMs)强大的能力,深入研究多模态(即视频和动作模态)人类行为理解。与最近专为仅视频或仅动作理解设计的LLMs不同,我们认为理解人类行为需要同时对视频和动作序列(例如SMPL序列)进行联合建模,以有效捕捉微妙的身体部位动态和语义。基于这一点,我们提出了MotionLLM,这是一个简单而有效的人体动作理解、字幕生成和推理框架。具体而言,MotionLLM采用统一的视频-动作训练策略,利用现有粗糙的视频-文本数据和细粒度的动作-文本数据的互补优势,获取丰富的时空见解。此外,我们收集了一个包含多样视频、动作、字幕和说明的大规模数据集MoVid。此外,我们提出了MoVid-Bench,配有精心手工注释,以更好地评估视频和动作上的人类行为理解。大量实验证明了MotionLLM在字幕生成、时空理解和推理能力方面的优越性。
English
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.

Summary

AI-Generated Summary

PDF218December 12, 2024