HopChain：用於可泛化視覺語言推理的多跳躍資料合成技術

摘要

視覺語言模型雖展現出強大的多模態能力，但在細粒度視覺語言推理方面仍存在不足。我們發現長鏈思維推理會暴露多種失敗模式，包括感知錯誤、推理錯誤、知識錯誤和幻覺錯誤，這些錯誤可能在多個中間步驟中疊加放大。然而，現有用於強化學習視覺推理的多數視覺語言數據並未包含全程依賴視覺證據的複雜推理鏈，使得這些弱點難以被充分暴露。為此，我們提出HopChain——一個可擴展的框架，專門用於合成多跳躍視覺語言推理數據以訓練視覺語言模型的強化學習視覺推理能力。每個合成的多跳躍查詢都形成邏輯依賴的實例錨定鏈，其中前期跳躍建立後續跳躍所需的實例、集合或條件，而最終答案保持為可驗證獎勵所需的具體明確數值。我們將HopChain合成的多跳躍數據添加到用於訓練Qwen3.5-35B-A3B和Qwen3.5-397B-A17B的原始強化學習視覺推理數據中，並在涵蓋STEM與謎題、通用視覺問答、文字識別與文檔理解、影片理解等24個基準測試上，與僅使用原始強化學習視覺推理數據的訓練結果進行比較。儘管這批多跳躍數據並非針對特定基準設計，但其加入使兩個模型在24個基準中的20個表現提升，顯示出廣泛且可泛化的增益。為驗證完整鏈式查詢的重要性，我們將其替換為半多跳躍或單跳躍變體，導致24個基準平均準確率分別下降5.3和7.0分。多跳躍訓練還強化了長鏈思維視覺語言推理能力，在超長鏈思維機制下準確率提升峰值超過50分。這些實驗證實HopChain能有效生成可擴展的多跳躍數據，持續提升視覺語言推理的泛化能力。

English

VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.

HopChain：用於可泛化視覺語言推理的多跳躍資料合成技術

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

摘要

Support