ChatPaper.aiChatPaper

AdaptToken:基于信息熵的自适应令牌选择方法在MLLM长视频理解中的应用

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

March 30, 2026
作者: Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys
cs.AI

摘要

由於高記憶體成本與上下文長度限制,長影片理解對多模態大語言模型(MLLMs)仍是挑戰。現有方法通過對短片片段內的影格/標記進行評分篩選來緩解此問題,但缺乏系統性機制以實現:(i) 跨遠距離影片片段的相關性比較,(ii) 蒐集足夠證據後自動停止處理。我們提出AdaptToken——一種免訓練框架,將MLLM的自我不確定性轉化為長影片標記選擇的全局控制信號。該框架將影片分割為多組,提取跨模態注意力對組內標記排序,並利用模型回應熵值估算各組與提示詞的相關性。此熵信號支持全局標記預算分配,並進一步實現早停機制(AdaptToken-Lite):當模型達到足夠確定性時跳過剩餘組處理。在四個長影片基準測試(VideoMME、LongVideoBench、LVBench與MLVU)及多個基礎MLLM(7B-72B)上的實驗表明,AdaptToken能持續提升準確率(如在Qwen2.5-VL 7B模型上平均提升6.7分),並在極長輸入(最高1萬影格)中保持效能,而AdaptToken-Lite可在保持相當性能的同時將推理時間減半。項目頁面:https://haozheqi.github.io/adapt-token
English
Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token
PDF31April 1, 2026