ChatPaper.aiChatPaper

無需縮放:面向精細多模態感知的區域到圖像蒸餾方法 (註:標題採用意譯策略,將"Zooming without Zooming"轉化為更具中文語感的"無需縮放",同時保留核心技術概念"區域到圖像蒸餾"。副標題精準傳達了論文聚焦的"精細多模態感知"技術方向,符合學術論文標題的簡潔性與專業性要求。)

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

February 12, 2026
作者: Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang
cs.AI

摘要

多模態大型語言模型(MLLMs)在廣泛的視覺理解任務中表現卓越,但在細粒度感知方面仍存在困難——這類任務中的關鍵證據往往體積微小且易被全域上下文信息淹沒。近期提出的「以圖思考」方法通過在推理階段反覆縮放關注區域來緩解此問題,但由於需要重複調用工具並重新編碼視覺信息,導致了較高的延遲。為解決這一缺陷,我們提出區域到圖像的知識蒸餾技術,將縮放操作從推理階段的工具轉變為訓練階段的基礎單元,從而將主動縮放的優勢內化到MLLM的單次前向傳播中。具體而言,我們先對微縮裁剪區域進行放大,讓強力教師模型生成高質量的視覺問答數據,再將這種基於區域的監督信號蒸餾回完整圖像。經過此類數據訓練後,小型學生模型無需使用工具即可提升「單次掃視」的細粒度感知能力。為系統評估此能力,我們進一步提出ZoomBench基準數據集,該數據集包含涵蓋六個細粒度感知維度的845個混合標註視覺問答樣本,並採用雙視角評估協議來量化全域與區域間的「縮放差距」。實驗表明,我們的模型在多個細粒度感知基準測試中取得領先性能,同時在視覺推理和GUI智能體等基準上提升了通用多模態認知能力。我們還深入探討了何時必須使用「以圖思考」策略,以及何時可將其增益蒸餾至單次前向傳播中。代碼已開源於:https://github.com/inclusionAI/Zooming-without-Zooming。
English
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.
PDF512February 17, 2026