AdaptToken：基于熵的自适应令牌选择方法在MLLM长视频理解中的应用

摘要

长视频理解因高内存消耗和上下文长度限制，始终是多模态大语言模型面临的核心挑战。现有方法通过短片段内的帧/令牌评分选择来缓解该问题，但缺乏系统性机制以（i）比较远距离视频片段间的相关性，以及（ii）在收集到充分证据后及时终止处理。我们提出AdaptToken——一种免训练框架，将MLLM的自我不确定性转化为长视频令牌选择的全局控制信号。该框架将视频分割为组块，通过跨模态注意力机制对组内令牌排序，并利用模型响应熵评估各组与提示词的相关性。熵信号支持全局令牌预算的动态分配，并进一步实现早停机制（AdaptToken-Lite）：当模型达到足够置信度时跳过剩余组块处理。在四个长视频基准数据集（VideoMME、LongVideoBench、LVBench和MLVU）及多种基座MLLM（7B-72B）上的实验表明，AdaptToken持续提升准确率（如在Qwen2.5-VL 7B模型上平均提升6.7分），并能有效利用超长输入（最高达1万帧）；而AdaptToken-Lite在保持相当性能的同时，将推理时间缩减约一半。项目页面：https://haozheqi.github.io/adapt-token

English

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token