ChatPaper.aiChatPaper

注意力之前的关注:通过自回归注视实现高效可扩展的视频理解

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

March 12, 2026
作者: Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin
cs.AI

摘要

多模态大语言模型(MLLMs)虽已推动通用视频理解技术的进步,但在处理长时高分辨率视频时仍面临挑战——尽管存在显著的时空冗余,其视觉变换器(ViTs)或大语言模型仍对每个像素进行均等处理。我们提出AutoGaze这一轻量级模块,可在视频被ViT或MLLM处理前自动去除冗余图像块。通过下一标记预测和强化学习联合训练,AutoGaze能自回归地选择最精简的多尺度图像块集合,在用户设定的误差阈值内实现视频重构,在保留信息的同时消除冗余。实验表明,AutoGaze可将视觉标记数量减少4-100倍,并将ViT和MLLM的处理速度提升最高达19倍,使得MLLM能够处理长达1000帧的4K分辨率视频,并在视频基准测试中取得领先成果(如在VideoMME上达到67.0%)。此外,我们推出HLVid基准:首个包含5分钟4K分辨率视频的高清长视频问答数据集,搭载AutoGaze的MLLM较基线模型提升10.1%,较原有最佳MLLM领先4.5%。项目页面:https://autogaze.github.io/。
English
Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.
PDF141March 26, 2026