시각적 기반 추론을 위한 지각 흐름 네트워크

초록

대규모 시각-언어 모델(LVLM)의 성공에도 불구하고, 일반적인 최적화 목표(예: 표준 MLE)는 시각적 궤적을 제약하지 못해 언어 편향과 환각 현상을 초래합니다. 이를 완화하기 위해 현재 방법론들은 시각 전문 모델의 기하학적 사전 지식을 추가적인 감독으로 도입합니다. 그러나 우리는 이러한 감독이 일반적으로 최적이 아니라는 점을 관찰했습니다. 이는 기하학적 정밀도에 편향되어 있으며 추론 유용성이 제한적입니다. 이러한 격차를 해결하기 위해 우리는 전문 모델의 사전 지식과의 경직된 정렬을 피하고 해석 가능하면서도 더 효과적인 시각 추론을 달성하는 Perceptual Flow Network(PFlowNet)를 제안합니다. 구체적으로, PFlowNet은 지각과 추론을 분리하여 자기 조건화 생성 과정을 구축합니다. 이를 바탕으로 변분 강화 학습을 통한 다차원 보상과 인접 기하학적 형상을 통합하여 시각적 신뢰성을 보존하면서 추론 중심의 지각 행동을 촉진합니다. PFlowNet은 검증 가능한 성능 보장과 경쟁력 있는 실험 결과를 제공하며, 특히 V* Bench(90.6%) 및 MME-RealWorld-lite(67.0%)에서 새로운 SOTA 기록을 수립합니다.

English

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

시각적 기반 추론을 위한 지각 흐름 네트워크

Perceptual Flow Network for Visually Grounded Reasoning

초록

Support