斑马思维链：面向交错式视觉语言推理的数据集

摘要

人类在解决复杂问题时，常借助视觉辅助工具，如图表或草图。训练多模态模型实现类似功能，即视觉思维链（Visual CoT），面临两大挑战：(1) 现成视觉CoT性能欠佳，阻碍了强化学习的应用；(2) 高质量视觉CoT训练数据的匮乏。为此，我们推出了Zebra-CoT，一个包含182,384个样本的多样化大规模数据集，其中蕴含逻辑连贯的图文交替推理轨迹。我们聚焦于四类任务，这些任务中绘图或视觉推理尤为自然，涵盖几何、物理、算法等科学问题；二维视觉推理任务，如视觉搜索与拼图；三维推理任务，包括三维多跳推理、具身及机器人规划；视觉逻辑问题及国际象棋等策略游戏。在Zebra-CoT训练集上微调Anole-7B模型，使测试集准确率提升了12%，并在标准VLM基准评估中最高获得13%的性能增益。微调Bagel-7B则生成了高质量的交织视觉推理链，充分证明了Zebra-CoT在开发多模态推理能力方面的有效性。我们开源了数据集与模型，以支持视觉CoT的开发与评估。

English

Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce Zebra-CoT, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.

斑马思维链：面向交错式视觉语言推理的数据集

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

摘要

Support