주의 이전에 주목: 자기회귀적 시선을 통한 효율적이고 확장 가능한 비디오 이해

초록

멀티모달 대규모 언어 모델(MLLM)은 범용 비디오 이해 능력을 향상시켰으나 긴 고해상도 비디오에서는 한계를 보입니다. 이들은 시공간적 중복성이 크게 존재함에도 불구하고 비전 트랜스포머(ViT)나 LLM에서 모든 픽셀을 동등하게 처리합니다. 우리는 ViT나 MLLM이 처리하기 전에 중복 패치를 제거하는 경량 모듈인 AutoGaze를 소개합니다. 다음 토큰 예측과 강화 학습으로 훈련된 AutoGaze는 사용자가 지정한 오류 임계값 내에서 비디오를 재구성할 수 있는 최소한의 다중 스케일 패치 집합을 자동회귀적으로 선택하여 정보를 보존하면서 중복성을 제거합니다. 실험적으로 AutoGaze는 시각적 토큰을 4~100배 줄이고 ViT와 MLLM의 처리 속도를 최대 19배 가속화하여 MLLM이 1,000프레임 4K 해상도 비디오로 확장되는 것을 가능하게 하며, 비디오 벤치마크(예: VideoMME에서 67.0%)에서 우수한 결과를 달성했습니다. 더 나아가, 5분 길이의 4K 해상도 비디오로 구성된 최초의 고해상도 장편 비디오 질의응답 벤치마크인 HLVid를 소개합니다. 여기서 AutoGaze로 확장된 MLLM은 기준선 대비 10.1% 향상되었고, 기존 최고 성능의 MLLM보다 4.5% 우수한 성능을 보였습니다. 프로젝트 페이지: https://autogaze.github.io/.

English

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

주의 이전에 주목: 자기회귀적 시선을 통한 효율적이고 확장 가능한 비디오 이해

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

초록

Support