ChatPaper.aiChatPaper

少数中的真理:高效多模态推理的高价值数据选择

Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

June 5, 2025
作者: Shenshen Li, Kaiyuan Deng, Lei Wang, Hao Yang, Chong Peng, Peng Yan, Fumin Shen, Heng Tao Shen, Xing Xu
cs.AI

摘要

尽管多模态大语言模型(MLLMs)通过强化学习在复杂推理任务上取得了显著进展,但普遍认为提升多模态推理能力需要大量训练数据,这不可避免地导致了数据冗余和巨大的计算成本。然而,较小的高价值数据集能否在MLLMs的多模态推理中匹敌甚至超越完整数据集?在本研究中,我们通过一个关键观察对这一假设提出挑战:有意义的多模态推理仅由训练样本中的稀疏子集——我们称之为认知样本——触发,而大多数样本贡献甚微。基于这一洞见,我们提出了一种新颖的数据选择范式,称为推理激活潜力(RAP),它通过两个互补的估计器来识别认知样本,评估每个样本激发真正多模态推理的潜力:1)基于潜在结果模型原则的因果差异估计器(CDE),通过比较多模态输入与纯文本输入下的输出,剔除过度依赖语言先验的样本;2)注意力置信度估计器(ACE),利用令牌级自注意力机制,丢弃在中间推理阶段被无关但过度强调的令牌主导的样本。此外,我们引入了难度感知替换模块(DRM),用认知挑战性实例替换简单实例,从而确保复杂性的多模态推理。在六个数据集上的实验表明,我们的RAP方法仅使用9.3%的训练数据便持续实现卓越性能,同时计算成本降低超过43%。我们的代码可在https://github.com/Leo-ssl/RAP 获取。
English
While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.

Summary

AI-Generated Summary

PDF191June 9, 2025