ChatPaper.aiChatPaper

视觉推理中的潜在思维链

Latent Chain-of-Thought for Visual Reasoning

October 27, 2025
作者: Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao
cs.AI

摘要

思维链推理对于提升大型视觉语言模型的可解释性与可靠性至关重要。然而,现有训练算法如SFT、PPO和GRPO在未见推理任务上泛化能力有限,且过度依赖存在偏差的奖励模型。为解决这一难题,我们将LVLM的推理重新定义为后验推断问题,并提出基于摊销变分推理的可扩展训练算法。通过采用多样性驱动的强化学习算法,我们设计了一种面向词元级学习信号的稀疏奖励函数,该函数能激励生成多样化且高似然度的潜在思维链,从而突破确定性采样的局限并避免奖励黑客行为。此外,我们实现了贝叶斯推理缩放策略,通过边际似然替代高成本的N选优和束搜索,以高效筛选最优推理路径与答案。实验结果表明,该方法在七个推理基准测试中从效能、泛化性和可解释性三个维度全面提升了当前最先进LVLM的性能。
English
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
PDF91December 1, 2025