AntGPT:大型語言模型能幫助從影片中預測長期行動嗎?
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
July 31, 2023
作者: Qi Zhao, Ce Zhang, Shijie Wang, Changcheng Fu, Nakul Agarwal, Kwonjoon Lee, Chen Sun
cs.AI
摘要
我們是否可以透過了解一位演員當前動作後通常會發生的事情(例如攪拌蛋)來更好地預測他/她的未來動作?如果我們還知道演員的長期目標(例如製作蛋炒飯),又會怎樣呢?長期動作預測(LTA)任務旨在從視頻觀察中以動詞和名詞序列的形式預測演員的未來行為,這對於人機交互至關重要。我們建議從兩個角度制定LTA任務:一種自下而上的方法,通過建模時間動態來自回歸地預測下一步動作;以及一種自上而下的方法,推斷演員的目標並計劃實現目標所需的程序。我們假設已在程序文本數據(例如食譜、操作指南)上預訓練的大型語言模型(LLMs)有潛力從這兩個角度幫助LTA。它可以幫助提供可能的下一步動作的先前知識,並根據觀察到的程序部分推斷目標。為了利用LLMs,我們提出了一個兩階段框架,AntGPT。它首先識別在觀察到的視頻中已執行的動作,然後通過條件生成要求LLM預測未來動作,或者通過思維鏈提示推斷目標並計劃整個程序。在Ego4D LTA v1和v2基準、EPIC-Kitchens-55以及EGTEA GAZE+上的實證結果展示了我們提出方法的有效性。AntGPT在所有上述基準上均取得了最先進的性能,並且可以成功推斷目標,因此通過定性分析實現了目標條件下的“反事實”預測。代碼和模型將在以下網址釋出:https://brown-palm.github.io/AntGPT
English
Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing
what commonly happens after his/her current action (e.g. crack eggs)? What if
we also know the longer-term goal of the actor (e.g. making egg fried rice)?
The long-term action anticipation (LTA) task aims to predict an actor's future
behavior from video observations in the form of verb and noun sequences, and it
is crucial for human-machine interaction. We propose to formulate the LTA task
from two perspectives: a bottom-up approach that predicts the next actions
autoregressively by modeling temporal dynamics; and a top-down approach that
infers the goal of the actor and plans the needed procedure to accomplish the
goal. We hypothesize that large language models (LLMs), which have been
pretrained on procedure text data (e.g. recipes, how-tos), have the potential
to help LTA from both perspectives. It can help provide the prior knowledge on
the possible next actions, and infer the goal given the observed part of a
procedure, respectively. To leverage the LLMs, we propose a two-stage
framework, AntGPT. It first recognizes the actions already performed in the
observed videos and then asks an LLM to predict the future actions via
conditioned generation, or to infer the goal and plan the whole procedure by
chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2
benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the
effectiveness of our proposed approach. AntGPT achieves state-of-the-art
performance on all above benchmarks, and can successfully infer the goal and
thus perform goal-conditioned "counterfactual" prediction via qualitative
analysis. Code and model will be released at
https://brown-palm.github.io/AntGPT