Zebra-CoT：一個用於交錯視覺語言推理的數據集

摘要

人類在解決複雜問題時，常借助視覺輔助工具，如圖表或草圖。訓練多模態模型實現相同的功能，即視覺思維鏈（Visual CoT），面臨以下挑戰：(1) 現成視覺CoT性能不佳，阻礙了強化學習的應用；(2) 缺乏高質量的視覺CoT訓練數據。我們引入了Zebra-CoT，這是一個包含182,384個樣本的多樣化大規模數據集，其中包含邏輯連貫的文本-圖像交錯推理軌跡。我們專注於四類任務，這些任務中繪製草圖或視覺推理尤為自然，涵蓋幾何、物理和算法等科學問題；二維視覺推理任務，如視覺搜索和拼圖；三維推理任務，包括三維多跳推理、具身及機器人規劃；視覺邏輯問題及策略遊戲，如國際象棋。在Zebra-CoT訓練語料上微調Anole-7B模型，使我們測試集的準確率提升了+12%，並在標準VLM基準評估中獲得了高達+13%的性能提升。微調Bagel-7B則生成高質量的交錯視覺推理鏈，彰顯了Zebra-CoT在開發多模態推理能力方面的有效性。我們開源了數據集和模型，以支持視覺CoT的開發與評估。

English

Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce Zebra-CoT, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.

Zebra-CoT：一個用於交錯視覺語言推理的數據集

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

摘要

Support