VideoAgent: 에이전트로서의 대형 언어 모델을 활용한 장편 비디오 이해

초록

장편 비디오 이해는 컴퓨터 비전 분야에서 중요한 도전 과제로, 긴 다중 모달 시퀀스에 대해 추론할 수 있는 모델을 요구합니다. 인간의 장편 비디오 이해를 위한 인지 과정에 영감을 받아, 우리는 긴 시각적 입력을 처리하는 능력보다 상호작용적 추론과 계획에 중점을 둡니다. 우리는 새로운 에이전트 기반 시스템인 VideoAgent를 소개합니다. 이 시스템은 대형 언어 모델을 중심 에이전트로 활용하여 질문에 답하기 위해 반복적으로 중요한 정보를 식별하고 수집하며, 비전-언어 기반 모델을 시각적 정보를 번역하고 검색하는 도구로 사용합니다. 도전적인 EgoSchema와 NExT-QA 벤치마크에서 평가된 결과, VideoAgent는 평균 8.4개와 8.2개의 프레임만 사용하여 각각 54.1%와 71.3%의 제로샷 정확도를 달성했습니다. 이러한 결과는 우리의 방법이 현재 최첨단 방법들보다 우수한 효과성과 효율성을 보여주며, 에이전트 기반 접근 방식이 장편 비디오 이해를 발전시키는 데 있어 잠재력을 강조합니다.

English

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.

VideoAgent: 에이전트로서의 대형 언어 모델을 활용한 장편 비디오 이해

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

초록

Support