LLaVAction:面向动作识别的多模态大语言模型评估与训练
LLaVAction: evaluating and training multi-modal large language models for action recognition
March 24, 2025
作者: Shaokai Ye, Haozhe Qi, Alexander Mathis, Mackenzie W. Mathis
cs.AI
摘要
理解人类行为需要对其行为动作进行测量。鉴于行为的复杂性,将其映射到如语言般丰富的语义结构上是最佳选择。近期发展的多模态大语言模型(MLLMs)为广泛的行为理解任务提供了极具潜力的解决方案。本研究中,我们着重于评估并提升MLLMs在动作识别上的表现。我们将EPIC-KITCHENS-100——最大且最具挑战性的第一人称视角动作数据集之一——重构为视频多问题回答形式(EPIC-KITCHENS-100-MQA)。研究表明,当选取难度较高的错误答案作为干扰项时,领先的MLLMs在识别正确动作方面面临困难。我们提出了一系列方法,显著增强了MLLMs的动作识别能力,不仅在EPIC-KITCHENS-100验证集上达到了最新技术水平,还在EPIC-KITCHENS-100-MQA上以21个百分点的准确率优势超越了GPT-4o。最后,我们在EgoSchema、PerceptionTest、LongVideoBench、VideoMME及MVBench等其他动作相关视频基准测试上也展示了改进效果,表明MLLMs在处理复杂动作任务方面是一条充满希望的发展路径。代码与模型已发布于:https://github.com/AdaptiveMotorControlLab/LLaVAction。
English
Understanding human behavior requires measuring behavioral actions. Due to
its complexity, behavior is best mapped onto a rich, semantic structure such as
language. The recent development of multi-modal large language models (MLLMs)
is a promising candidate for a wide range of action understanding tasks. In
this work, we focus on evaluating and then improving MLLMs to perform action
recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most
challenging egocentric action datasets, to the form of video multiple question
answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult
incorrect answers as distractors, leading MLLMs struggle to recognize the
correct actions. We propose a series of methods that greatly improve the MLLMs'
ability to perform action recognition, achieving state-of-the-art on both the
EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points
in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other
action-related video benchmarks such as EgoSchema, PerceptionTest,
LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising
path forward for complex action tasks. Code and models are available at:
https://github.com/AdaptiveMotorControlLab/LLaVAction.Summary
AI-Generated Summary