QuoTA：基於CoT查詢解耦的查詢導向型令牌分配用於長視頻理解

摘要

近期長視頻理解領域的進展，通常基於注意力分佈來進行視覺標記剪枝，從而減少視覺冗餘。然而，現有方法雖在解碼器層採用事後低響應標記剪枝，卻忽略了視覺標記與指令（查詢）之間在輸入層面的語義關聯。本文提出QuoTA，一種無需訓練的模組化方法，它擴展了現有的大型視頻語言模型（LVLMs），基於查詢導向的幀級重要性評估來進行視覺標記分配。查詢導向的標記選擇至關重要，因為它使視覺處理與任務特定需求對齊，在保持語義相關內容的同時，優化標記預算的使用。具體而言，(i) QuoTA根據查詢相關性策略性地分配幀級重要性分數，使得在解碼器層進行跨模態交互前一次性完成視覺標記分配，(ii) 我們通過思維鏈推理解耦查詢，以促進更精確的基於LVLM的幀重要性評分，以及(iii) QuoTA提供即插即用功能，可擴展至現有的LVLMs。大量實驗結果表明，在LLaVA-Video-7B上實施QuoTA，在六個基準測試（包括Video-MME和MLVU）中平均性能提升3.2%，同時在與基線相同的視覺標記預算內運行。代碼已開源於https://github.com/MAC-AutoML/QuoTA。

English

Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at https://github.com/MAC-AutoML/QuoTA.

QuoTA：基於CoT查詢解耦的查詢導向型令牌分配用於長視頻理解

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

摘要

Support