Q-Zoom：面向高效多模态大语言模型的查询自适应感知机制

摘要

多模态大模型在处理文档理解与密集场景感知等细粒度任务时，需要高分辨率视觉输入。然而，当前全局分辨率缩放范式会 indiscriminately 向二次复杂度的自注意力机制 flooding 视觉冗余令牌，在忽略空间稀疏性与查询意图的同时严重制约推理吞吐量。为此，我们提出Q-Zoom——一种查询感知的自适应高分辨率感知框架，采用高效的由粗到精处理机制。首先，轻量级动态门控网络在粗粒度全局特征足够时安全 bypass 高分辨率处理；其次，针对需要细粒度感知的查询，自蒸馏区域提议网络直接从中间特征空间精准定位任务相关感兴趣区域。为高效优化这些模块，门控网络采用一致性感知生成策略推导确定性路由标签，而自蒸馏区域提议网络则通过完全自监督的蒸馏范式实现。通过连续时空对齐方案与针对性微调，密集局部感兴趣区域可与粗粒度全局布局无缝融合。大量实验表明，Q-Zoom建立了显著的帕累托前沿优势。以Qwen2.5-VL-7B为主要测试平台，Q-Zoom在文档与OCR基准上实现2.52倍推理加速，高分辨率场景下加速比达4.39倍，同时保持基线模型峰值精度。当配置为最大化感知保真度时，Q-Zoom在上述基准上的峰值性能分别超越基线1.1%和8.1%。这些稳健改进可无缝迁移至Qwen3-VL、LLaVA及新兴的基于强化学习的图像思维模型。项目页面详见https://yuhengsss.github.io/Q-Zoom/。

English

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

Q-Zoom：面向高效多模态大语言模型的查询自适应感知机制

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

摘要

Support