Zelf-Evoluerende Visuele Vraagsteller

Samenvatting

Visie-taalmodelen (VLM) worden doorgaans getraind als passieve beantwoorders, terwijl hun vermogen om actief diverse, niet-triviale, visueel-centrische en gefundeerde vragen te stellen nog onderbelicht is. De prestaties van bestaande visuele vraagstellers worden beperkt door de beschikbaarheid van hoogwaardige trainingsdata of de kosten om deze te verzamelen. We tonen aan dat een VLM zichzelf continu kan verbeteren als visuele vraagsteller zonder enige externe supervisie. We stellen een zelf-evoluerend raamwerk voor dat een VLM zelf gebruikt als zowel voorsteller als filter om moeilijkere, informatievere en visueel-centrische vragen te produceren, terwijl de exploratiediversiteit behouden blijft om trainingsinstorting te voorkomen. Deze vragen worden vervolgens gebruikt om de VLM te trainen in zowel vraagsteller- als beantwoorder-modus. Om de vraagsteller te evalueren introduceren we een agentprotocol dat vragen beoordeelt op perceptie, redeneren en diversiteitsdimensies. Experimenten met verschillende backbone VLM’s tonen aan dat onze methode de kwaliteit aanzienlijk verbetert en de moeilijkheidsgrens van autonome vraaggeneratie aanzienlijk verlegt. Onder hetzelfde budget is onze zelfsupervisie effectiever dan trainen op statische brongegevens. Bovendien blijft de zelf-evoluerende vraagsteller een concurrerende of zelfs betere beantwoorder.

English

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.