ResAdapt:自适应分辨率的高效能多模态推理框架
ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
March 30, 2026
作者: Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu, Kang Liu
cs.AI
摘要
多模态大语言模型(MLLMs)通过提升输入保真度实现了更强的视觉理解能力,但由此产生的视觉标记增长使得同时维持高空间分辨率与长时序上下文变得难以实现。我们认为瓶颈不在于编码后表示的压缩方式,而在于编码器接收的像素总量,并为此提出ResAdapt——一种输入端自适应框架,可在编码前学习每帧图像应分配的视觉预算。ResAdapt将轻量级分配器与未经改动的MLLM主干耦合,使主干在接收经算子变换的输入时仍能保持其原有的视觉标记接口。我们将分配问题建模为上下文赌博机,并通过成本感知策略优化(CAPO)训练分配器,该算法将稀疏的 rollout 反馈转化为稳定的精度-成本学习信号。在预算受控的视频问答、时序定位和图像推理任务中,ResAdapt提升了低预算运行点的性能,且常位于或接近效率-精度边界,在强压缩条件下的推理密集型基准测试中增益最为显著。值得注意的是,在相同视觉预算下,ResAdapt可支持多达16倍的帧数,同时实现超过15%的性能提升。代码已开源:https://github.com/Xnhyacinth/ResAdapt。
English
Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.