代理拼圖互動學習:提升視覺-語言模型中的視覺感知與推理能力
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
October 1, 2025
作者: Yu Zeng, Wenxuan Huang, Shiting Huang, Xikun Bao, Yukun Qi, Yiming Zhao, Qiuchen Wang, Lin Chen, Zehui Chen, Huaian Chen, Wanli Ouyang, Feng Zhao
cs.AI
摘要
尽管当前的大型视觉语言模型(VLMs)在多模态理解与推理方面取得了进展,但其基本的感知与推理能力仍显不足。具体而言,即使在简单的拼图任务上,现有VLMs的表现也近乎随机,暴露出核心感知与推理能力的缺陷。虽然高质量的视觉语言数据能够提升这些能力,但其稀缺性和有限的可扩展性构成了显著制约。为此,我们提出了AGILE(Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs),旨在通过交互式学习增强VLMs的视觉感知与推理能力。AGILE将拼图解决过程建模为一个互动流程,使模型能够逐步与环境互动。在每一步中,模型根据当前状态生成可执行代码以执行动作,而环境则提供细粒度的视觉反馈以指导任务完成。通过这种观察与互动的迭代循环,模型通过探索与反馈逐步提升其感知与推理能力。实验结果显示,AGILE不仅在复杂度各异的拼图任务上大幅提升了性能(例如,在2×2设置下准确率从9.5%提升至82.8%),还在9项通用视觉任务上展现出强大的泛化能力,平均提升了3.1%。这些结果表明了感知与推理能力的显著增强。本研究为推进多模态模型的推理与泛化能力开辟了新途径,并为多模态强化学习数据的稀缺问题提供了一个高效、可扩展的解决方案。代码与数据集可在https://github.com/yuzeng0-0/AGILE获取。
English
Although current large Vision-Language Models (VLMs) have advanced in
multimodal understanding and reasoning, their fundamental perceptual and
reasoning abilities remain limited. Specifically, even on simple jigsaw tasks,
existing VLMs perform near randomly, revealing deficiencies in core perception
and reasoning capabilities. While high-quality vision-language data can enhance
these capabilities, its scarcity and limited scalability impose significant
constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction
Learning for Enhancing visual perception and reasoning in VLMs. AGILE
formulates jigsaw solving as an interactive process, enabling the model to
progressively engage with the environment. At each step, the model generates
executable code to perform an action based on the current state, while the
environment provides fine-grained visual feedback to guide task completion.
Through this iterative cycle of observation and interaction, the model
incrementally improves its perceptual and reasoning capabilities via
exploration and feedback. Experimental results show that AGILE not only
substantially boosts performance on jigsaw tasks of varying complexity (e.g.,
increasing accuracy from 9.5% to 82.8% under the 2 times 2 setting) but also
demonstrates strong generalization across 9 general vision tasks, achieving an
average improvement of 3.1%. These results indicate notable enhancements in
both perceptual and reasoning abilities. This work opens a new avenue for
advancing reasoning and generalization in multimodal models and provides an
efficient, scalable solution to the scarcity of multimodal reinforcement
learning data. The code and datasets is available at
https://github.com/yuzeng0-0/AGILE .