ResAdapt：効率的なマルチモーダル推論のための適応的解像度

要旨

マルチモーダル大規模言語モデル（MLLM）は、入力の精緻度をスケーリングすることで強力な視覚的理解を実現するが、それに伴う視覚トークンの増大により、高空間解像度と長時間の時間的文脈の両立が困難になっている。本論文では、このボトルネックが符号化後の表現圧縮方法ではなく、エンコーダが受け取るピクセル量にあると主張し、各フレームが符号化前に受け取るべき視覚的バジェットを学習する入力側適応フレームワーク「ResAdapt」によってこの問題に取り組む。ResAdaptは軽量なアロケータを変更なしのMLLMバックボーンと結合し、バックボーンは演算子変換された入力を受け取りながら、本来の視覚トークンインターフェースを保持する。割り当て問題を文脈付きバンディットとして定式化し、スパースなロールアウトフィードバックを安定的な精度-コスト学習信号に変換するコスト考慮型ポリシー最適化（CAPO）によりアロケータを訓練する。バジェット制御下の動画QA、時間的グラウンディング、画像推論タスクにおいて、ResAdaptは低バジェット動作点を改善し、効率-精度フロンティア上またはその近傍に位置することが多く、特に積極的圧縮下での推論集約型ベンチマークで最も明確な性能向上を示す。特筆すべきは、同じ視覚的バジェットで最大16倍のフレーム数をサポートしながら、15%以上の性能向上を実現することである。コードはhttps://github.com/Xnhyacinth/ResAdapt で公開されている。

English

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.

ResAdapt：効率的なマルチモーダル推論のための適応的解像度

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

要旨

Support