VisionFoundry：利用合成图像教授视觉语言模型视觉感知能力

摘要

視覺語言模型（VLMs）在空間理解與視角識別等視覺感知任務中仍存在明顯不足。一個可能的關鍵因素在於：自然圖像數據集對低階視覺技能的監督信號有限。這引發了一個現實問題：僅通過任務關鍵詞（如「深度順序」）生成的定向合成監督數據，能否解決這些缺陷？為探討此問題，我們提出VisionFoundry——一種任務感知型合成數據生成流程，該系統僅以任務名稱作為輸入，利用大語言模型（LLMs）生成問答對和文生圖（T2I）提示，隨後通過T2I模型合成圖像，並使用專有VLM進行一致性驗證，整個過程無需參考圖像或人工標註。基於VisionFoundry，我們構建了包含10個任務、共計1萬個圖像-問題-答案三元組的合成視覺問答（VQA）數據集VisionFoundry-10K。在該數據集上訓練的模型於視覺感知基準測試中取得顯著提升：MMVP指標提升7%，CV-Bench-3D指標提升10%，同時保持廣泛的泛化能力，並呈現出隨數據規模擴大的良好擴展性。研究結果表明，缺乏針對特定任務的監督數據是制約VLM性能的關鍵瓶頸，而合成監督數據為實現更系統化的VLM訓練開闢了可行路徑。

English

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

VisionFoundry：利用合成图像教授视觉语言模型视觉感知能力

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

摘要

Support