AdaptToken：基于信息熵的自适应令牌选择方法在MLLM长视频理解中的应用

摘要

由於高記憶體成本與上下文長度限制，長影片理解對多模態大語言模型（MLLMs）仍是挑戰。現有方法通過對短片片段內的影格/標記進行評分篩選來緩解此問題，但缺乏系統性機制以實現：(i) 跨遠距離影片片段的相關性比較，(ii) 蒐集足夠證據後自動停止處理。我們提出AdaptToken——一種免訓練框架，將MLLM的自我不確定性轉化為長影片標記選擇的全局控制信號。該框架將影片分割為多組，提取跨模態注意力對組內標記排序，並利用模型回應熵值估算各組與提示詞的相關性。此熵信號支持全局標記預算分配，並進一步實現早停機制（AdaptToken-Lite）：當模型達到足夠確定性時跳過剩餘組處理。在四個長影片基準測試（VideoMME、LongVideoBench、LVBench與MLVU）及多個基礎MLLM（7B-72B）上的實驗表明，AdaptToken能持續提升準確率（如在Qwen2.5-VL 7B模型上平均提升6.7分），並在極長輸入（最高1萬影格）中保持效能，而AdaptToken-Lite可在保持相當性能的同時將推理時間減半。項目頁面：https://haozheqi.github.io/adapt-token

English

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

AdaptToken：基于信息熵的自适应令牌选择方法在MLLM长视频理解中的应用

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

摘要

Support