HopChain：一般化可能な視覚言語推論のためのマルチホップデータ合成

要旨

VLMは強力なマルチモーダル能力を示すものの、細粒度の視覚言語推論には依然として課題を抱えています。我々は、長い連鎖思考（CoT）推論が、知覚、推論、知識、幻覚といった多様な失敗モードを露呈し、それらが中間ステップで累積することを発見しました。しかし、RLVRに用いられる既存の視覚言語データの多くは、視覚的証拠に依存する複雑な推論連鎖を含んでおらず、これらの弱点が十分に検出されていません。そこで我々は、VLMのRLVRトレーニング専用にマルチホップ視覚言語推論データを合成するスケーラブルなフレームワーク「HopChain」を提案します。合成された各マルチホップクエリは、論理的に依存するインスタンス接地型ホップの連鎖を形成し、初期のホップが後続のホップに必要なインスタンス、集合、条件を確立します。最終回答は検証可能な報酬に適した、具体的で曖昧さのない数値となります。HopChainで合成したマルチホップデータを、Qwen3.5-35B-A3BおよびQwen3.5-397B-A17Bのトレーニングに使用された元のRLVRデータに追加し、STEMとパズル、一般VQA、文字認識と文書理解、動画理解にわたる24のベンチマークで、従来のRLVRデータのみを使用した場合と比較しました。このマルチホップデータは特定のベンチマークをターゲットに合成されていないにもかかわらず、追加により両モデルで24ベンチマーク中20において改善が確認され、広範かつ一般化可能な効果が示されました。完全な連鎖クエリの重要性を実証するため、マルチホップの半減バージョンまたはシングルホップバージョンに置き換えたところ、24ベンチマークの平均精度がそれぞれ5.3ポイント、7.0ポイント低下しました。マルチホップトレーニングは長文CoT視覚言語推論も強化し、超長文CoT領域では精度が50ポイント以上向上するピークが見られました。これらの実験により、HopChainが一般化可能な視覚言語推論を改善するマルチホップデータ合成の効果的でスケーラブルなフレームワークであることが確認されました。

English

VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.

HopChain：一般化可能な視覚言語推論のためのマルチホップデータ合成

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

要旨

Support