视觉推理的潜在思维链

摘要

思维链推理对于提升大型视觉语言模型的可解释性与可靠性至关重要。然而现有训练算法如SFT、PPO和GRPO在未见过推理任务上的泛化能力有限，且高度依赖存在偏差的奖励模型。为解决这一难题，我们将LVLM的推理重新定义为后验推断问题，并提出基于摊销变分推断的可扩展训练算法。通过采用多样性探索的强化学习算法，我们设计了一种新型稀疏奖励函数，该函数能提供词元级学习信号以激励多样化、高似然度的潜在思维链，从而突破确定性采样的局限并避免奖励黑客行为。此外，我们实现了贝叶斯推断缩放策略，通过边际似然替代计算成本高昂的N选优和束搜索方法，高效筛选最优推理路径与答案。实证研究表明，该方法在七个推理基准测试中全方位提升了先进LVLM模型的有效性、泛化能力和可解释性。

English

Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.

视觉推理的潜在思维链

Latent Chain-of-Thought for Visual Reasoning

摘要

Support