AntGPT：大型語言模型能幫助從影片中預測長期行動嗎？

摘要

我們是否可以透過了解一位演員當前動作後通常會發生的事情（例如攪拌蛋）來更好地預測他/她的未來動作？如果我們還知道演員的長期目標（例如製作蛋炒飯），又會怎樣呢？長期動作預測（LTA）任務旨在從視頻觀察中以動詞和名詞序列的形式預測演員的未來行為，這對於人機交互至關重要。我們建議從兩個角度制定LTA任務：一種自下而上的方法，通過建模時間動態來自回歸地預測下一步動作；以及一種自上而下的方法，推斷演員的目標並計劃實現目標所需的程序。我們假設已在程序文本數據（例如食譜、操作指南）上預訓練的大型語言模型（LLMs）有潛力從這兩個角度幫助LTA。它可以幫助提供可能的下一步動作的先前知識，並根據觀察到的程序部分推斷目標。為了利用LLMs，我們提出了一個兩階段框架，AntGPT。它首先識別在觀察到的視頻中已執行的動作，然後通過條件生成要求LLM預測未來動作，或者通過思維鏈提示推斷目標並計劃整個程序。在Ego4D LTA v1和v2基準、EPIC-Kitchens-55以及EGTEA GAZE+上的實證結果展示了我們提出方法的有效性。AntGPT在所有上述基準上均取得了最先進的性能，並且可以成功推斷目標，因此通過定性分析實現了目標條件下的“反事實”預測。代碼和模型將在以下網址釋出：https://brown-palm.github.io/AntGPT

English

Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT

AntGPT：大型語言模型能幫助從影片中預測長期行動嗎？

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

摘要

Support