HopChain: 일반화 가능한 비전-언어 추론을 위한 멀티홍 데이터 합성

초록

VLM은 강력한 다중모달 능력을 보여주지만, 여전히 세밀한 수준의 시각-언어 추론에는 어려움을 겪습니다. 우리는 긴 CoT 추론이 지각, 추론, 지식, 환각 오류를 포함한 다양한 실패 모드를 드러내며, 이러한 오류들이 중간 단계에서 누적될 수 있음을 발견했습니다. 그러나 RLVR에 사용되는 대부분의 기존 시각-언어 데이터는 시각적 증거에 전반적으로 의존하는 복잡한 추론 체인을 포함하지 않아 이러한 약점이 크게 노출되지 않습니다. 따라서 우리는 VLM의 RLVR 훈련을 위해 특화된 다중 홉 시각-언어 추론 데이터를 합성하는 확장 가능한 프레임워크인 HopChain을 제안합니다. 합성된 각 다중 홉 쿼리는 인스턴스에 기반한 논리적으로 종속적인 홉 체인을 형성하며, 초기 홉은 후속 홉에 필요한 인스턴스, 집합 또는 조건을 설정하고 최종 답변은 검증 가능한 보상에 적합한 구체적이고 명확한 숫자로 유지됩니다. 우리는 HopChain으로 합성된 다중 홉 데이터를 Qwen3.5-35B-A3B 및 Qwen3.5-397B-A17B 훈련에 사용된 원본 RLVR 데이터에 추가하고, STEM 및 퍼즐, 일반 VQA, 텍스트 인식 및 문서 이해, 비디오 이해를 아우르는 24개 벤치마크에서 원본 RLVR 데이터만 사용한 RLVR 대비 성능을 비교합니다. 이 다중 홉 데이터는 특정 벤치마크를 대상으로 합성된 것이 아님에도 불구하고, 추가 시 두 모델 모두 24개 벤치마크 중 20개에서 성능 향상을 보여 넓고 일반화 가능한 이점을 확인했습니다. 완전한 체인 쿼리의 중요성을 입증하기 위해 이를 반-다중 홉 또는 단일 홉 변형으로 대체했을 때, 24개 벤치마크 평균 정확도가 각각 5.3점, 7.0점 하락했습니다. 다중 홉 훈련은 긴 CoT 시각-언어 추론도 강화하며, 특히 초장형 CoT 영역에서 50% 포인트 이상의 정확도 향상 정점을 보였습니다. 이러한 실험들은 HopChain이 일반화 가능한 시각-언어 추론을 개선하는 다중 홉 데이터를 합성하는 효과적이고 확장 가능한 프레임워크임을 입증합니다.

English

VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.

HopChain: 일반화 가능한 비전-언어 추론을 위한 멀티홍 데이터 합성

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

초록

Support