AntGPT: 대규모 언어 모델이 비디오에서의 장기적 행동 예측에 도움을 줄 수 있을까?

초록

현재 행동(예: 계란 깨기) 이후에 일반적으로 발생하는 상황을 알고 있다면, 행위자의 미래 행동(예: 계란 섞기)을 더 잘 예측할 수 있을까? 또한 행위자의 장기적 목표(예: 계란 볶음밥 만들기)를 알고 있다면 어떨까? 장기적 행동 예측(LTA) 작업은 비디오 관찰을 통해 행위자의 미래 행동을 동사와 명사 시퀀스 형태로 예측하는 것을 목표로 하며, 인간-기계 상호작용에 있어 매우 중요하다. 우리는 LTA 작업을 두 가지 관점에서 공식화할 것을 제안한다: 시간적 역학을 모델링하여 다음 행동을 자동회귀적으로 예측하는 하향식 접근법과, 행위자의 목표를 추론하고 목표를 달성하기 위해 필요한 절차를 계획하는 상향식 접근법이다. 우리는 레시피나 방법론과 같은 절차 텍스트 데이터로 사전 학습된 대형 언어 모델(LLM)이 두 관점 모두에서 LTA를 지원할 잠재력이 있다고 가정한다. LLM은 가능한 다음 행동에 대한 사전 지식을 제공하고, 관찰된 절차의 일부를 바탕으로 목표를 추론하는 데 도움을 줄 수 있다. LLM을 활용하기 위해, 우리는 AntGPT라는 두 단계 프레임워크를 제안한다. 이 프레임워크는 먼저 관찰된 비디오에서 이미 수행된 행동을 인식한 다음, 조건부 생성을 통해 미래 행동을 예측하거나, 사고 연쇄 프롬프팅을 통해 목표를 추론하고 전체 절차를 계획하도록 LLM에 요청한다. Ego4D LTA v1 및 v2 벤치마크, EPIC-Kitchens-55, 그리고 EGTEA GAZE+에서의 실험 결과는 우리가 제안한 접근법의 효과를 입증한다. AntGPT는 위의 모든 벤치마크에서 최첨단 성능을 달성했으며, 질적 분석을 통해 목표를 성공적으로 추론하고 목표 기반의 "반사실적" 예측을 수행할 수 있다. 코드와 모델은 https://brown-palm.github.io/AntGPT에서 공개될 예정이다.

English

Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT

AntGPT: 대규모 언어 모델이 비디오에서의 장기적 행동 예측에 도움을 줄 수 있을까?

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

초록

Support