VisualWebInstruct: Web検索によるマルチモーダル指示データのスケールアップ

要旨

視覚言語モデル（Vision-Language Models）は、多くの知覚中心のタスクにおいて大きな進歩を遂げてきました。しかし、推論中心のタスクにおける進展は、高品質で多様なトレーニングデータの不足により限定的です。本研究では、推論中心のマルチモーダルデータセットの不足問題に取り組むことを目指しています。私たちは、検索エンジンを活用して、数学、物理学、金融、化学など複数の分野にわたる多様で高品質なデータセットを作成する新しいアプローチ「VisualWebInstruct」を提案します。厳選された30,000枚のシード画像を出発点として、Google画像検索を使用して類似画像を含むウェブサイトを特定します。700,000以上のユニークなURLソースからHTMLを収集し、処理します。コンテンツ抽出、フィルタリング、合成のパイプラインを通じて、約900,000の質問-回答ペアからなるデータセットを構築します。そのうち40%が視覚的QAペアで、残りがテキストQAペアです。VisualWebInstructでファインチューニングされたモデルは、顕著な性能向上を示しています：(1) Llava-OV-midからのトレーニングでは、ベンチマーク全体で10-20%の絶対ポイントの向上が見られ、(2) MAmmoTH-VLからのトレーニングでは5%の絶対的な向上が見られました。私たちの最高のモデルであるMAmmoTH-VL2は、10Bパラメータクラスにおいて、MMMU-Pro-std（40.7%）、MathVerse（42.6%）、DynaMath（55.7%）で最先端の性能を示しています。これらの注目すべき結果は、複雑なマルチモーダルタスクにおけるVLMの推論能力を向上させるための私たちのデータセットの有効性を強調しています。

English

Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct - a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40% being visual QA pairs and the rest as text QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath (55.7%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs' reasoning capabilities for complex multimodal tasks.

VisualWebInstruct: Web検索によるマルチモーダル指示データのスケールアップ

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

要旨

Support