AntGPT: I modelli linguistici di grandi dimensioni possono aiutare nell'anticipazione di azioni a lungo termine dai video?

Abstract

Possiamo anticipare meglio le azioni future di un attore (ad esempio, mescolare le uova) conoscendo ciò che comunemente accade dopo la sua azione corrente (ad esempio, rompere le uova)? E se conoscessimo anche l'obiettivo a lungo termine dell'attore (ad esempio, preparare del riso saltato con uova)? Il compito di anticipazione delle azioni a lungo termine (LTA) mira a prevedere il comportamento futuro di un attore a partire da osservazioni video sotto forma di sequenze di verbi e sostantivi, ed è cruciale per l'interazione uomo-macchina. Proponiamo di formulare il compito LTA da due prospettive: un approccio bottom-up che prevede le azioni successive in modo autoregressivo modellando le dinamiche temporali; e un approccio top-down che inferisce l'obiettivo dell'attore e pianifica la procedura necessaria per raggiungerlo. Ipotesizziamo che i grandi modelli linguistici (LLM), che sono stati pre-addestrati su dati testuali procedurali (ad esempio, ricette, guide), abbiano il potenziale di aiutare il LTA da entrambe le prospettive. Possono infatti fornire la conoscenza a priori sulle possibili azioni successive e inferire l'obiettivo data la parte osservata di una procedura, rispettivamente. Per sfruttare i LLM, proponiamo un framework in due fasi, AntGPT. Prima riconosce le azioni già eseguite nei video osservati e poi chiede a un LLM di prevedere le azioni future tramite generazione condizionata, o di inferire l'obiettivo e pianificare l'intera procedura tramite prompt a catena di pensiero. I risultati empirici sui benchmark Ego4D LTA v1 e v2, EPIC-Kitchens-55, così come EGTEA GAZE+ dimostrano l'efficacia del nostro approccio proposto. AntGPT raggiunge prestazioni all'avanguardia su tutti i benchmark sopra citati e può inferire con successo l'obiettivo, eseguendo così previsioni "controfattuali" condizionate all'obiettivo tramite analisi qualitative. Codice e modello saranno rilasciati su https://brown-palm.github.io/AntGPT.

English

Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT

AntGPT: I modelli linguistici di grandi dimensioni possono aiutare nell'anticipazione di azioni a lungo termine dai video?

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

Abstract

Support