自进化视觉提问器

摘要

视觉-语言模型（VLMs）通常被训练为被动的回答者，而其主动提出多样化、非平凡、以视觉为中心且基于场景的问题的能力仍未被充分探索。现有视觉提问器的性能受限于高质量训练数据的可用性或整理这些数据的成本。我们证明，VLM可以在无需任何外部监督的情况下，持续自我提升为视觉提问器。我们提出了一种自演进框架，该框架利用VLM自身同时作为提议器和过滤器，生成更具挑战性、信息更丰富且以视觉为中心的问题，同时维持其探索多样性以避免训练崩溃。这些问题随后被用于训练VLM的提问器模式和回答器模式。为评估提问器，我们引入了一种智能体协议，从感知、推理和多样性维度评估问题。在多种骨干VLM上的实验表明，我们的方法显著提升了自主问题生成的质量，并大幅扩展了其难度边界。在相同预算下，我们的自监督比在静态源数据上训练更为有效。此外，自演进提问器仍能保持为具有竞争力甚至更优的回答器。

English

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.