視覺思維鏈:透過連續視覺標記教導視覺語言模型更優地觀察與思考
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
November 24, 2025
作者: Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, Xudong Wang
cs.AI
摘要
視覺語言模型(VLMs)在語言空間的推理方面表現卓越,但在需要密集視覺感知的認知理解(例如空間推理與幾何意識)方面仍存在侷限。此限制源於當前VLMs缺乏有效機制來捕捉跨空間維度的密集視覺資訊。我們提出「視覺思維鏈」(COVT)框架,使VLMs不僅能透過語言推理,更能透過連續視覺標記——一種編碼豐富感知線索的緊湊潛在表徵——進行思考。在約20個標記的有限預算內,COVT從輕量級視覺專家模型中提煉知識,捕捉如二維外觀、三維幾何、空間佈局與邊緣結構等互補特徵。訓練過程中,搭載COVT的VLM透過自回歸方式預測這些視覺標記,以重建密集監督信號(如深度圖、分割圖、邊緣特徵與DINO特徵)。推理階段,模型直接在連續視覺標記空間中進行推理,在保持效率的同時可選解碼生成密集預測以提升可解釋性。在超過十項多元感知基準測試(含CV-Bench、MMVP、RealWorldQA、MMStar、WorldMedQA與HRBench)中驗證,將COVT整合至Qwen2.5-VL與LLaVA等強效VLMs後,性能持續提升3%至16%,證明緊湊的連續視覺思維能實現更精準、紮根且可解釋的多模態智能。
English
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.