ChatPaper.aiChatPaper

数据策展在多模态推理中起何作用?DCVLR挑战赛的启示

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

January 16, 2026
作者: Yosub Shin, Michael Buriek, Boris Sobolev, Pavel Bushuyeu, Vikas Kumar, Haoyang Xu, Samuel Watson, Igor Molybog
cs.AI

摘要

我们通过NeurIPS 2025视觉语言推理数据策展挑战赛(DCVLR)研究多模态推理的数据优化方案。该挑战赛通过固定模型与训练流程来隔离数据集选择的影响。我们基于沃尔顿多模态冷启动数据集构建的紧凑型精选数据集在本次竞赛中荣获第一名。赛后消融实验表明,在已对齐的基础数据集上实施基于难度的样本选择是提升性能的关键驱动力。在固定训练方案下,扩大数据集规模并不能稳定提高平均准确率,主要作用在于降低多次运行的方差;而常用的多样性增强和合成数据启发式方法不仅无法带来额外收益,反而常常降低性能。这些结果表明DCVLR属于饱和态评估范式,同时凸显了数据对齐与难度筛选在高效多模态推理中的核心作用。
English
We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and training protocol. Using a compact curated dataset derived primarily from Walton Multimodal Cold Start, our submission placed first in the challenge. Through post-competition ablations, we show that difficulty-based example selection on an aligned base dataset is the dominant driver of performance gains. Increasing dataset size does not reliably improve mean accuracy under the fixed training recipe, but mainly reduces run-to-run variance, while commonly used diversity and synthetic augmentation heuristics provide no additional benefit and often degrade performance. These results characterize DCVLR as a saturation-regime evaluation and highlight the central role of alignment and difficulty in data-efficient multimodal reasoning.
PDF11January 20, 2026