Zelfcorrectie aanleren in vision-language-modellen via rollout-augmentatie

Samenvatting

Zelfcorrectie is essentieel voor het oplossen van complexe redeneerproblemen in visie-taalmodellen (VLM's). Bestaande methoden voor reinforcement learning (RL) slagen er echter niet in dit aan te leren, omdat effectief zelfcorrigerend gedrag slechts zelden optreedt, wat de leer signalen extreem schaars maakt. Om deze uitdaging aan te pakken, stellen we correctie-specifieke rollouts (Octopus) voor, een RL-raamwerk voor rollout-augmentatie dat dichte zelfcorrectie-voorbeelden synthetiseert door bestaande rollouts te hercombineren. Deze augmentatie verbetert tegelijkertijd de sample-efficiëntie door hergebruik van rollouts en stabiliseert de RL-optimalisatie door gebalanceerd toezicht. Verder introduceren we een respons-maskerstrategie die zelfcorrectie ontkoppelt van direct redeneren, waardoor signaalconflicten worden vermeden en beide gedragingen effectief kunnen worden aangeleerd. Hierop voortbordurend introduceren we Octopus-8B, een redeneer-VLM met een beheerbare zelfcorrectie-capaciteit. Op 7 benchmarks behaalt het state-of-the-art (SoTA) prestaties onder open-source VLM's, waarbij het de beste RLVR-baseline met 1.0 score verslaat terwijl het slechts 0.72 keer de trainings tijd per stap nodig heeft.

English

Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only 0.72times training time per step.

Zelfcorrectie aanleren in vision-language-modellen via rollout-augmentatie

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Samenvatting

Support