ChatPaper.aiChatPaper

Monet:超越圖像與語言的潛在視覺空間推理

Monet: Reasoning in Latent Visual Space Beyond Images and Language

November 26, 2025
作者: Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang
cs.AI

摘要

「圖像思維」已成為推進視覺推理的有效範式,通過在推理過程中注入視覺證據,突破了純文本思維鏈的局限性。然而現有方法因受制於外部工具,其靈活性存在根本局限,難以實現類人的抽象視覺思維。本研究提出Monet訓練框架,使多模態大語言模型能夠通過生成連續嵌入作為中間視覺思維,直接在潛在視覺空間進行推理。我們發現訓練MLLMs進行潛在視覺推理存在兩大核心挑戰:潛在視覺對齊的高計算成本與潛在嵌入監督不足,為此設計了基於三階段蒸餾的監督微調流程。我們進一步揭示GRPO應用於潛在推理的局限:其主要增強文本推理而非潛在推理。為此提出VLPO(視覺潛在策略優化),這種強化學習方法將潛在嵌入明確納入策略梯度更新。為支持SFT,我們構建了Monet-SFT-125K數據集——包含12.5萬條真實世界圖表、OCR和幾何思維鏈的高質量圖文交錯CoT數據。我們的Monet-7B模型在真實世界感知與推理基準測試中持續提升,並在挑戰性抽象視覺推理任務上展現出強大的分佈外泛化能力。我們通過實證分析各訓練組件的作用,並討論早期失敗嘗試,為視覺潛在推理的未來發展提供見解。模型、數據及代碼已開源於:https://github.com/NOVAglow646/Monet。
English
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.
PDF142December 1, 2025