视觉基石：利用合成图像教授视觉语言模型视觉感知

摘要

视觉语言模型（VLM）在空间理解和视角识别等视觉感知任务上仍存在明显不足。一个关键原因在于自然图像数据集对底层视觉技能提供的监督信息有限。这引出了一个现实问题：仅通过任务关键词（如深度顺序）生成的定向合成监督能否解决这些缺陷？为探究此问题，我们提出VisionFoundry——一种任务感知型合成数据生成流程，仅需输入任务名称即可利用大语言模型（LLM）生成问题、答案和文生图提示，再通过文生图模型合成图像，并借助专有VLM验证一致性，整个过程无需参考图像或人工标注。基于该流程，我们构建了包含10个任务、1万组图像-问题-答案三元组的合成视觉问答数据集VisionFoundry-10K。使用该数据集训练的模型在视觉感知基准测试中取得显著提升：MMVP指标提升7%，CV-Bench-3D指标提升10%，同时保持广泛能力，并随数据量增加呈现良好的扩展性。研究结果表明，缺乏任务定向监督是当前瓶颈的重要成因，而合成监督为构建更系统化的VLM训练路径提供了可行方案。

English

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

视觉基石：利用合成图像教授视觉语言模型视觉感知

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

摘要

Support