在多模态推理的数据策展中,什么才是关键?来自DCVLR挑战赛的启示
What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge
January 16, 2026
作者: Yosub Shin, Michael Buriek, Boris Sobolev, Pavel Bushuyeu, Vikas Kumar, Haoyang Xu, Samuel Watson, Igor Molybog
cs.AI
摘要
我们通过NeurIPS 2025视觉语言推理数据策展挑战赛(DCVLR)研究多模态推理的数据策展方法。该挑战赛通过固定模型与训练方案来隔离数据集选择的影响。我们使用主要源自Walton多模态冷启动项目的精简数据集参赛,最终获得冠军。赛后消融实验表明:在基础对齐数据集上实施基于难度的样本选择是提升性能的关键驱动因素。在固定训练方案下,扩大数据集规模虽能降低实验随机误差,但无法持续提升平均准确率;而常用的多样性增强和合成数据启发式方法不仅无益,反而常导致性能下降。这些结果表明DCVLR属于饱和态评估范式,并凸显了数据对齐与难度筛选在高效多模态推理中的核心作用。
English
We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and training protocol. Using a compact curated dataset derived primarily from Walton Multimodal Cold Start, our submission placed first in the challenge. Through post-competition ablations, we show that difficulty-based example selection on an aligned base dataset is the dominant driver of performance gains. Increasing dataset size does not reliably improve mean accuracy under the fixed training recipe, but mainly reduces run-to-run variance, while commonly used diversity and synthetic augmentation heuristics provide no additional benefit and often degrade performance. These results characterize DCVLR as a saturation-regime evaluation and highlight the central role of alignment and difficulty in data-efficient multimodal reasoning.