Q-Zoom：面向高效多模态大语言模型的查询感知自适应感知方法

摘要

针对文档理解、密集场景感知等细粒度任务，多模态大模型（MLLMs）需要高分辨率视觉输入。然而，当前全局分辨率缩放范式会 indiscriminately 向二次自注意力机制 flooding 视觉冗余令牌，在忽略空间稀疏性与查询意图的同时严重制约推理吞吐量。为此，我们提出Q-Zoom——一种查询感知的自适应高分辨率感知框架，采用高效的由粗到细处理机制。首先，轻量化动态门控网络在粗粒度全局特征足够时安全绕过高分辨率处理；其次，针对需要细粒度感知的查询，自蒸馏区域提议网络（SD-RPN）直接从中间特征空间精确定位任务相关感兴趣区域（RoI）。为高效优化这些模块，门控网络采用一致性感知生成策略推导确定性路由标签，而SD-RPN则通过全自监督蒸馏范式进行训练。通过连续时空对齐方案与定向微调，稠密局部RoI可与粗粒度全局布局无缝融合。大量实验表明，Q-Zoom建立了优势帕累托边界：以Qwen2.5-VL-7B为主要测试平台，在文档OCR基准上推理速度提升2.52倍，高分辨率场景下加速达4.39倍，同时保持基线峰值精度；当配置为最大感知保真度时，Q-Zoom在上述基准的峰值性能分别超越基线1.1%和8.1%。这些稳健改进可无缝迁移至Qwen3-VL、LLaVA及新兴的基于强化学习的图像思维模型。项目页面详见https://yuhengsss.github.io/Q-Zoom/。

English

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.