HopChain:面向泛化性视觉语言推理的多跳数据合成
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
March 17, 2026
作者: Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin
cs.AI
摘要
视觉语言模型(VLM)展现出强大的多模态能力,但在细粒度视觉语言推理方面仍存在不足。我们发现长链思维推理(CoT)会暴露多种错误模式,包括感知、推理、知识和幻觉错误,这些错误可能在中间步骤中累积放大。然而,现有大多数用于强化视觉语言推理(RLVR)的数据集并未包含全程依赖视觉证据的复杂推理链,导致这些缺陷难以被充分暴露。为此,我们提出HopChain——一个可扩展的框架,专门为VLM的RLVR训练合成多跳视觉语言推理数据。每个合成的多跳查询都构成逻辑上相互依赖的实例锚定链,其中前序跳步为后续跳步建立实例、集合或条件,而最终答案保持为可验证奖励所需的明确数值。我们将HopChain合成的多跳数据添加到用于训练Qwen3.5-35B-A3B和Qwen3.5-397B-A17B的原始RLVR数据中,并在涵盖STEM与谜题、通用VQA、文本识别与文档理解、视频理解等24个基准测试中,与仅使用原始RLVR数据的方法进行对比。尽管这些多跳数据并非针对特定基准定制,但其加入使得两个模型在24个基准中的20个上表现提升,显示出广泛且可泛化的增益。为验证完整链式查询的重要性,我们将其替换为半多跳或单跳变体,导致24个基准平均准确率分别下降5.3和7.0个百分点。多跳训练还强化了长链CoT视觉语言推理能力,在超长链CoT场景下准确率提升峰值超过50个百分点。这些实验证明HopChain是一种高效、可扩展的多跳数据合成框架,能显著提升视觉语言推理的泛化能力。
English
VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.