AntGPT: 大規模言語モデルはビデオからの長期行動予測に役立つか？

要旨

現在の行動（例：卵を割る）の後に一般的に起こることを知ることで、行動者の将来の行動（例：卵を混ぜる）をより良く予測できるだろうか？また、行動者の長期的な目標（例：卵チャーハンを作る）も知っている場合はどうだろうか？長期的行動予測（LTA）タスクは、動画観察から行動者の将来の行動を動詞と名詞のシーケンスとして予測することを目指し、人間と機械の相互作用において重要である。我々は、LTAタスクを2つの視点から定式化することを提案する：時間的ダイナミクスをモデル化して次の行動を自己回帰的に予測するボトムアップアプローチと、行動者の目標を推論し、その目標を達成するために必要な手順を計画するトップダウンアプローチである。我々は、手順テキストデータ（例：レシピ、ハウツー）で事前学習された大規模言語モデル（LLM）が、両方の視点からLTAを支援する可能性があると仮定する。LLMは、可能な次の行動に関する事前知識を提供し、観察された手順の一部から目標を推論するのに役立つ。LLMを活用するために、我々は2段階のフレームワークであるAntGPTを提案する。まず、観察された動画ですでに実行された行動を認識し、次にLLMに条件付き生成を通じて将来の行動を予測させたり、連鎖的思考プロンプトを通じて目標を推論し、全体の手順を計画させたりする。Ego4D LTA v1およびv2ベンチマーク、EPIC-Kitchens-55、およびEGTEA GAZE+での実証結果は、我々の提案するアプローチの有効性を示している。AntGPTは、上記のすべてのベンチマークで最先端の性能を達成し、定性的分析を通じて目標を成功裏に推論し、目標条件付きの「反事実的」予測を実行できる。コードとモデルはhttps://brown-palm.github.io/AntGPTで公開される予定である。

English

Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT

AntGPT: 大規模言語モデルはビデオからの長期行動予測に役立つか？

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

要旨

Support