VisionFoundry：合成画像によるVLMの視覚認識能力向上

要旨

視覚言語モデル（VLM）は、空間理解や視点認識などの視覚的知覚タスクにおいて依然として困難を抱えている。一因として、自然画像データセットが低次元の視覚スキルに対する監督情報を限定的にしか提供しないことが考えられる。これにより、「Depth Order（深度順序）」のような単一のタスクキーワードのみから生成された、対象特化型の合成的な監督情報がこれらの弱点を克服できるか、という実践的な疑問が生じる。この疑問を検証するため、我々はVisionFoundryを提案する。これはタスク名のみを入力とし、大規模言語モデル（LLM）を用いて質問、回答、テキストto画像（T2I）プロンプトを生成し、T2Iモデルで画像を合成後、独自のVLMで一貫性を検証する、タスクを意識した合成的データ生成パイプラインである。参照画像や人手による注釈を一切必要としない。VisionFoundryを用いて、10のタスクにわたる1万組の画像・質問・回答のトリプルから成る合成的視覚質問応答（VQA）データセットVisionFoundry-10Kを構築した。VisionFoundry-10Kで学習したモデルは、視覚的知覚ベンチマークで大幅な改善を示した：MMVPで+7%、CV-Bench-3Dで+10%の向上を達成し、より広範な能力を維持しつつ、データ量の増加に伴う良好なスケーリング挙動も確認された。本研究の結果は、対象タスクに特化した監督情報の不足がこのボトルネックの主要因の一つであること、および合成的な監督情報がVLMのより体系的な訓練に向けた有望な道筋であることを示唆している。

English

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

VisionFoundry：合成画像によるVLMの視覚認識能力向上

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

要旨

Support