LLaVAction：面向动作识别的多模态大语言模型评估与训练

摘要

理解人类行为需要对其行为动作进行测量。鉴于行为的复杂性，将其映射到如语言般丰富的语义结构上是最佳选择。近期发展的多模态大语言模型（MLLMs）为广泛的行为理解任务提供了极具潜力的解决方案。本研究中，我们着重于评估并提升MLLMs在动作识别上的表现。我们将EPIC-KITCHENS-100——最大且最具挑战性的第一人称视角动作数据集之一——重构为视频多问题回答形式（EPIC-KITCHENS-100-MQA）。研究表明，当选取难度较高的错误答案作为干扰项时，领先的MLLMs在识别正确动作方面面临困难。我们提出了一系列方法，显著增强了MLLMs的动作识别能力，不仅在EPIC-KITCHENS-100验证集上达到了最新技术水平，还在EPIC-KITCHENS-100-MQA上以21个百分点的准确率优势超越了GPT-4o。最后，我们在EgoSchema、PerceptionTest、LongVideoBench、VideoMME及MVBench等其他动作相关视频基准测试上也展示了改进效果，表明MLLMs在处理复杂动作任务方面是一条充满希望的发展路径。代码与模型已发布于：https://github.com/AdaptiveMotorControlLab/LLaVAction。

English

Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.

LLaVAction：面向动作识别的多模态大语言模型评估与训练

LLaVAction: evaluating and training multi-modal large language models for action recognition

摘要

Support