V-Thinker:基于图像的交互式思考
V-Thinker: Interactive Thinking with Images
November 6, 2025
作者: Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang
cs.AI
摘要
赋能大型多模态模型(LMM)实现图像交互与长程推理能力的深度融合,始终是该领域长期存在的挑战。近期以视觉为核心的推理研究探索出极具前景的"图像化思考"范式,标志着从图像辅助推理向图像交互式思维的转变。尽管这一里程碑式进展使模型能聚焦细粒度图像区域,但有限的视觉工具空间与任务定制化流程设计仍制约着进一步发展。为突破此局限,我们提出V-Thinker——一种通过端到端强化学习实现交互式视觉中心化思维的通用多模态推理助手。该框架包含两大核心组件:(1)数据进化飞轮,可沿多样性、质量与难度三维度自动合成、演进并验证交互式推理数据集;(2)视觉渐进式训练课程,先通过点级监督实现感知对齐,再经由两阶段强化学习框架融合交互推理。此外,我们推出经专家验证的VTBench基准测试集,专门针对视觉中心化交互推理任务。大量实验表明,V-Thinker在通用推理与交互推理场景中均持续超越基于LMM的强基线模型,为推进图像交互式推理应用提供了宝贵洞见。
English
Empowering Large Multimodal Models (LMMs) to deeply integrate image
interaction with long-horizon reasoning capabilities remains a long-standing
challenge in this field. Recent advances in vision-centric reasoning explore a
promising "Thinking with Images" paradigm for LMMs, marking a shift from
image-assisted reasoning to image-interactive thinking. While this milestone
enables models to focus on fine-grained image regions, progress remains
constrained by limited visual tool spaces and task-specific workflow designs.
To bridge this gap, we present V-Thinker, a general-purpose multimodal
reasoning assistant that enables interactive, vision-centric thinking through
end-to-end reinforcement learning. V-Thinker comprises two key components: (1)
a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies
interactive reasoning datasets across three dimensions-diversity, quality, and
difficulty; and (2) a Visual Progressive Training Curriculum that first aligns
perception via point-level supervision, then integrates interactive reasoning
through a two-stage reinforcement learning framework. Furthermore, we introduce
VTBench, an expert-verified benchmark targeting vision-centric interactive
reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently
outperforms strong LMM-based baselines in both general and interactive
reasoning scenarios, providing valuable insights for advancing
image-interactive reasoning applications.