树前见林:基于潜在叠加的高效视觉推理
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
January 11, 2026
作者: Yubo Wang, Juntian Zhang, Yichen Wu, Yankai Lin, Nils Lukas, Yuhan Liu
cs.AI
摘要
尽管思维链赋能大型视觉语言模型进行多步推理,但显式文本推理存在信息带宽瓶颈——连续的视觉细节在离散化分词过程中被丢弃。近期潜在推理方法试图解决这一挑战,却常因僵化的自回归目标陷入过早语义坍缩。本文提出Laser新范式,通过动态窗口对齐学习重构视觉推理过程。该方法摒弃逐点预测的强制约束,将潜在状态与未来语义的动态有效窗口对齐。这种机制构建了"先见林后见树"的认知层级:模型在聚焦局部细节前能保持全局特征的概率叠加态。关键在于,Laser通过可解码轨迹保持可解释性,同时借助自优化叠加实现无约束学习的稳定化。在6个基准测试上的大量实验表明,Laser在潜在推理方法中实现最先进性能,较强势基线Monet平均提升5.03%。值得注意的是,该模型以极高效率达成这些提升,推理标记数减少超97%,并展现出对分布外领域的强大泛化能力。
English
While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a "Forest-before-Trees" cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.