ChatPaper.aiChatPaper

VisionFoundry:利用合成图像教授视觉语言模型视觉感知能力

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

April 10, 2026
作者: Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu
cs.AI

摘要

視覺語言模型(VLMs)在空間理解與視角識別等視覺感知任務中仍存在明顯不足。一個可能的關鍵因素在於:自然圖像數據集對低階視覺技能的監督信號有限。這引發了一個現實問題:僅通過任務關鍵詞(如「深度順序」)生成的定向合成監督數據,能否解決這些缺陷?為探討此問題,我們提出VisionFoundry——一種任務感知型合成數據生成流程,該系統僅以任務名稱作為輸入,利用大語言模型(LLMs)生成問答對和文生圖(T2I)提示,隨後通過T2I模型合成圖像,並使用專有VLM進行一致性驗證,整個過程無需參考圖像或人工標註。基於VisionFoundry,我們構建了包含10個任務、共計1萬個圖像-問題-答案三元組的合成視覺問答(VQA)數據集VisionFoundry-10K。在該數據集上訓練的模型於視覺感知基準測試中取得顯著提升:MMVP指標提升7%,CV-Bench-3D指標提升10%,同時保持廣泛的泛化能力,並呈現出隨數據規模擴大的良好擴展性。研究結果表明,缺乏針對特定任務的監督數據是制約VLM性能的關鍵瓶頸,而合成監督數據為實現更系統化的VLM訓練開闢了可行路徑。
English
Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.
PDF61April 14, 2026