感知流网络在视觉推理中的应用

摘要

尽管大型视觉语言模型（LVLMs）已取得显著成果，但通用优化目标（如标准MLE）难以约束视觉轨迹，导致语言偏差与幻觉问题。现有方法通过引入视觉专家的几何先验作为额外监督来缓解此问题，但我们发现此类监督通常存在次优性：其偏向几何精度而推理效用有限。为弥补这一差距，我们提出感知流网络（PFlowNet），该网络摒弃与专家先验的刚性对齐，实现可解释且更高效的视觉推理。具体而言，PFlowNet通过感知与推理的解耦建立自条件生成过程，并基于变分强化学习将多维奖励与邻近几何塑形相结合，从而在保持视觉可靠性的同时促进面向推理的感知行为。PFlowNet具备可证明的性能保证与极具竞争力的实证结果，尤其在V* Bench（90.6%）和MME-RealWorld-lite（67.0%）基准上刷新了当前最优性能纪录。

English

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

感知流网络在视觉推理中的应用

Perceptual Flow Network for Visually Grounded Reasoning

摘要

Support