視覚的根拠に基づく推論のための知覚フローネットワーク

要旨

大規模視覚言語モデル（LVLM）の成功にもかかわらず、一般的な最適化目標（例：標準MLE）は視覚的軌跡を拘束できず、言語バイアスと幻覚を引き起こす。この問題を緩和するため、現在の手法では視覚専門モデルからの幾何学的事前情報を追加的な監督として導入している。しかし、このような監督は一般に最適ではないことが観察される：幾何学的精度に偏りがあり、推論実用性が限定的である。この隔たりを埋めるため、我々は知覚フローネットワーク（PFlowNet）を提案する。これは専門モデルの事前情報との厳密な整合を排し、解釈可能性が高くより効果的な視覚推論を実現する。具体的には、PFlowNetは知覚と推論を分離し、自己条件付き生成プロセスを構築する。これに基づき、変分強化学習による多次元報酬と近接幾何学形成を統合し、視覚的信頼性を保ちつつ推論指向の知覚行動を促進する。PFlowNetは証明可能な性能保証と競争力のある実験結果を示し、特にV* Bench（90.6%）とMME-RealWorld-lite（67.0%）で新たなSOTA記録を樹立した。

English

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

視覚的根拠に基づく推論のための知覚フローネットワーク

Perceptual Flow Network for Visually Grounded Reasoning

要旨

Support