LLaVAction: 行動認識のためのマルチモーダル大規模言語モデルの評価とトレーニング

要旨

人間の行動を理解するためには、行動そのものを測定する必要があります。その複雑さゆえに、行動は言語のような豊かな意味構造にマッピングすることが最適です。近年開発されたマルチモーダル大規模言語モデル（MLLMs）は、幅広い行動理解タスクにおいて有望な候補となっています。本研究では、MLLMsの評価とその改善に焦点を当て、行動認識の性能向上を目指します。最大級の難易度を誇るエゴセントリック行動データセットであるEPIC-KITCHENS-100を、ビデオ多肢選択問題（EPIC-KITCHENS-100-MQA）の形式に再構築しました。困難な不正解をディストラクタとしてサンプリングすると、主要なMLLMsが正しい行動を認識するのに苦戦することが明らかになりました。私たちは、MLLMsの行動認識能力を大幅に向上させる一連の手法を提案し、EPIC-KITCHENS-100の検証セットにおいて最先端の性能を達成し、EPIC-KITCHENS-100-MQAではGPT-4oを21ポイント上回る精度を実現しました。最後に、EgoSchema、PerceptionTest、LongVideoBench、VideoMME、MVBenchなどの他の行動関連ビデオベンチマークでも改善を示し、MLLMsが複雑な行動タスクにおいて有望な道筋であることを示唆しています。コードとモデルは以下で公開されています：https://github.com/AdaptiveMotorControlLab/LLaVAction。

English

Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.

LLaVAction: 行動認識のためのマルチモーダル大規模言語モデルの評価とトレーニング

LLaVAction: evaluating and training multi-modal large language models for action recognition

要旨

Support