无需缩放:面向细粒度多模态感知的区域到图像蒸馏方法
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
February 12, 2026
作者: Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang
cs.AI
摘要
多模态大语言模型(MLLMs)在广义视觉理解方面表现出色,但在细粒度感知任务中仍存在困难——这类任务中的关键证据往往尺寸微小且易被全局语境淹没。近期出现的“图像思维”方法通过推理时对感兴趣区域进行迭代式缩放来缓解此问题,但由于需要重复调用工具并重新编码视觉信息,会导致较高延迟。为此,我们提出区域到图像蒸馏技术,将缩放操作从推理阶段的工具转化为训练阶段的基本单元,从而将智能缩放的优势内化至MLLM的单次前向传播中。具体而言,我们首先对微裁剪区域进行放大,让强教师模型生成高质量视觉问答数据,随后将这种基于区域的监督信息蒸馏回完整图像。经过此类数据训练后,小型学生模型无需使用工具即可提升“单次瞥视”的细粒度感知能力。为系统评估该能力,我们进一步提出ZoomBench——一个包含845个视觉问答数据的混合标注基准数据集,涵盖六个细粒度感知维度,并配套双视角评估协议以量化全局与区域的“缩放差距”。实验表明,我们的模型在多个细粒度感知基准测试中均取得领先性能,同时在视觉推理、GUI智能体等通用多模态认知任务上也有提升。我们进一步探讨了何时必须采用“图像思维”策略,以及何时其增益可被蒸馏至单次前向传播。代码已开源:https://github.com/inclusionAI/Zooming-without-Zooming。
English
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.