少数中的真理：高效多模态推理的高价值数据选择

摘要

尽管多模态大语言模型（MLLMs）通过强化学习在复杂推理任务上取得了显著进展，但普遍认为提升多模态推理能力需要大量训练数据，这不可避免地导致了数据冗余和巨大的计算成本。然而，较小的高价值数据集能否在MLLMs的多模态推理中匹敌甚至超越完整数据集？在本研究中，我们通过一个关键观察对这一假设提出挑战：有意义的多模态推理仅由训练样本中的稀疏子集——我们称之为认知样本——触发，而大多数样本贡献甚微。基于这一洞见，我们提出了一种新颖的数据选择范式，称为推理激活潜力（RAP），它通过两个互补的估计器来识别认知样本，评估每个样本激发真正多模态推理的潜力：1）基于潜在结果模型原则的因果差异估计器（CDE），通过比较多模态输入与纯文本输入下的输出，剔除过度依赖语言先验的样本；2）注意力置信度估计器（ACE），利用令牌级自注意力机制，丢弃在中间推理阶段被无关但过度强调的令牌主导的样本。此外，我们引入了难度感知替换模块（DRM），用认知挑战性实例替换简单实例，从而确保复杂性的多模态推理。在六个数据集上的实验表明，我们的RAP方法仅使用9.3%的训练数据便持续实现卓越性能，同时计算成本降低超过43%。我们的代码可在https://github.com/Leo-ssl/RAP 获取。

English

While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.

少数中的真理：高效多模态推理的高价值数据选择

Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

摘要

Support