QuoTA:基於CoT查詢解耦的查詢導向型令牌分配用於長視頻理解
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
March 11, 2025
作者: Yongdong Luo, Wang Chen, Xiawu Zheng, Weizhong Huang, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Jiebo Luo, Rongrong Ji
cs.AI
摘要
近期長視頻理解領域的進展,通常基於注意力分佈來進行視覺標記剪枝,從而減少視覺冗餘。然而,現有方法雖在解碼器層採用事後低響應標記剪枝,卻忽略了視覺標記與指令(查詢)之間在輸入層面的語義關聯。本文提出QuoTA,一種無需訓練的模組化方法,它擴展了現有的大型視頻語言模型(LVLMs),基於查詢導向的幀級重要性評估來進行視覺標記分配。查詢導向的標記選擇至關重要,因為它使視覺處理與任務特定需求對齊,在保持語義相關內容的同時,優化標記預算的使用。具體而言,(i) QuoTA根據查詢相關性策略性地分配幀級重要性分數,使得在解碼器層進行跨模態交互前一次性完成視覺標記分配,(ii) 我們通過思維鏈推理解耦查詢,以促進更精確的基於LVLM的幀重要性評分,以及(iii) QuoTA提供即插即用功能,可擴展至現有的LVLMs。大量實驗結果表明,在LLaVA-Video-7B上實施QuoTA,在六個基準測試(包括Video-MME和MLVU)中平均性能提升3.2%,同時在與基線相同的視覺標記預算內運行。代碼已開源於https://github.com/MAC-AutoML/QuoTA。
English
Recent advances in long video understanding typically mitigate visual
redundancy through visual token pruning based on attention distribution.
However, while existing methods employ post-hoc low-response token pruning in
decoder layers, they overlook the input-level semantic correlation between
visual tokens and instructions (query). In this paper, we propose QuoTA, an
ante-hoc training-free modular that extends existing large video-language
models (LVLMs) for visual token assignment based on query-oriented frame-level
importance assessment. The query-oriented token selection is crucial as it
aligns visual processing with task-specific requirements, optimizing token
budget utilization while preserving semantically relevant content.
Specifically, (i) QuoTA strategically allocates frame-level importance scores
based on query relevance, enabling one-time visual token assignment before
cross-modal interactions in decoder layers, (ii) we decouple the query through
Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame
importance scoring, and (iii) QuoTA offers a plug-and-play functionality that
extends to existing LVLMs. Extensive experimental results demonstrate that
implementing QuoTA with LLaVA-Video-7B yields an average performance
improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while
operating within an identical visual token budget as the baseline. Codes are
open-sourced at https://github.com/MAC-AutoML/QuoTA.Summary
AI-Generated Summary