Adaptief Redeneerpaden Leren voor Efficiënte Visuele Redenering

Samenvatting

Visuele redeneermodellen (VRM's) hebben recentelijk sterke cross-modale redeneercapaciteiten getoond door visuele perceptie te integreren met taalredenering. Ze lijden echter vaak aan overdenken, waarbij ze onnodig lange redeneerketens produceren voor allerlei taken. Wij schrijven dit probleem toe aan Redundantie van het Redeneerpad in visueel redeneren: veel visuele vragen vereisen niet het volledige redeneerproces. Om dit aan te pakken, stellen wij AVR voor, een adaptief visueel raamwerk dat visueel redeneren ontleedt in drie cognitieve functies: visuele perceptie, logisch redeneren en antwoordtoepassing. Het stelt modellen verder in staat om dynamisch te kiezen tussen drie responsformaten: Volledig Formaat, Alleen-Perceptie Formaat en Direct Antwoord. AVR wordt getraind met FS-GRPO, een aanpassing van Group Relative Policy Optimization die het model aanmoedigt om het meest efficiënte redeneerformaat te selecteren terwijl de correctheid behouden blijft. Experimenten op meerdere vision-language benchmarks tonen aan dat AVR het tokenverbruik met 50–90% reduceert while de algemene nauwkeurigheid behoudt, vooral bij perceptie-intensieve taken. Deze resultaten tonen aan dat adaptief visueel redeneren overdenken in VRM's effectief kan verminderen. Code en data zijn beschikbaar op: https://github.com/RunRiotComeOn/AVR.

English

Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to Reasoning Path Redundancy in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose AVR, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.

Adaptief Redeneerpaden Leren voor Efficiënte Visuele Redenering

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Samenvatting

Support