Monet:超越图像与语言的潜在视觉空间推理
Monet: Reasoning in Latent Visual Space Beyond Images and Language
November 26, 2025
作者: Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang
cs.AI
摘要
"图像思维"已成为推进视觉推理的有效范式,它通过将视觉证据注入中间推理步骤,超越了纯文本的思维链模式。然而,现有方法在类人抽象视觉思维方面存在不足,其灵活性从根本上受限于外部工具。本研究提出Monet训练框架,使多模态大语言模型能够通过生成作为中间视觉思维的连续嵌入,直接在潜在视觉空间中进行推理。我们识别出训练MLLMs进行潜在视觉推理的两大核心挑战:潜在视觉对齐的高计算成本与对潜在嵌入监督不足,并通过三阶段基于蒸馏的监督微调流程予以解决。我们进一步揭示了GRPO在潜在推理应用中的局限:它主要增强文本推理而非潜在推理。为此提出VLPO(视觉潜在策略优化),这种强化学习方法将潜在嵌入显式纳入策略梯度更新。为支持SFT,我们构建了Monet-SFT-125K数据集——包含12.5万条真实场景、图表、OCR和几何推理链的高质量图文交错CoT数据集。我们的Monet-7B模型在真实场景感知与推理基准上实现持续提升,并在挑战性抽象视觉推理任务中展现出强大的分布外泛化能力。我们还实证分析了各训练组件的作用,并讨论了早期不成功的尝试,为视觉潜在推理的未来发展提供洞见。模型、数据及代码已开源:https://github.com/NOVAglow646/Monet。
English
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.