自我進化視覺提問器

摘要

視覺語言模型（VLM）通常被訓練為被動回答者，但其主動提出多樣、非平凡、以視覺為核心且具基礎性問題的能力仍未被充分探索。現有視覺提問者的表現受限於高品質訓練數據的可取得性，或整理這類數據的成本。我們證明，VLM 可以在沒有任何外部監督的情況下，持續自我提升為一個視覺提問者。我們提出一個自我演進框架，該框架利用 VLM 自身同時作為「提出者」與「過濾者」，以產出更困難、更具資訊性且更以視覺為核心的問題，同時維持其探索的多樣性，以避免訓練崩潰。這些問題隨後被用於訓練 VLM，使其兼具提問者與回答者兩種模式。為評估提問者，我們引入一個代理協議，該協議從感知、推理與多樣性三個維度來評估問題。跨多種骨幹 VLM 的實驗顯示，我們的方法顯著提升了自主問題生成的品質，並大幅擴展了其難度邊界。在相同預算下，我們的自我監督比在靜態來源數據上進行訓練更有效。此外，這個自我演進的提問者仍是一個具有競爭力甚至更好的回答者。

English

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.