少数の中の真実：効率的なマルチモーダル推論のための高価値データ選択

要旨

マルチモーダル大規模言語モデル（MLLMs）は、強化学習を通じて複雑な推論タスクにおいて大きな進展を遂げてきたが、マルチモーダル推論能力を向上させるためには、大規模な訓練データが必要であると一般的に考えられており、これがデータの冗長性と多大な計算コストを引き起こすことは避けられない。しかし、より小さな高価値のデータセットが、MLLMsにおけるマルチモーダル推論において、完全なコーパスに匹敵する、あるいはそれを上回る性能を発揮することは可能だろうか？本研究では、この仮定に挑戦するために、一つの重要な観察に基づいている：意味のあるマルチモーダル推論は、訓練サンプルのうちごく一部の疎なサブセット、すなわち「認知的サンプル」によって引き起こされる一方で、大多数のサンプルはわずかな貢献しかもたらさない。この洞察に基づき、我々は「推論活性化ポテンシャル（Reasoning Activation Potential, RAP）」と呼ばれる新しいデータ選択パラダイムを提案する。RAPは、各サンプルが真のマルチモーダル推論を刺激するポテンシャルを推定することで、認知的サンプルを特定する。この推定は、2つの補完的な推定器によって行われる：1）「因果的差異推定器（Causal Discrepancy Estimator, CDE）」は、潜在アウトカムモデルの原理に基づき、マルチモーダル入力とテキストのみの入力間の出力を比較することで、言語事前知識に過度に依存するサンプルを排除する；2）「注意信頼度推定器（Attention Confidence Estimator, ACE）」は、トークンレベルの自己注意機構を利用し、中間推論段階において無関係だが過剰に強調されたトークンに支配されるサンプルを除外する。さらに、我々は「難易度認識置換モジュール（Difficulty-aware Replacement Module, DRM）」を導入し、単純なインスタンスを認知的に挑戦的なものに置き換えることで、堅牢なマルチモーダル推論のための複雑性を確保する。6つのデータセットを用いた実験により、我々のRAP手法は、訓練データのわずか9.3%を使用しながら、一貫して優れた性能を達成し、計算コストを43%以上削減することが示された。コードはhttps://github.com/Leo-ssl/RAPで公開されている。

English

While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.

少数の中の真実：効率的なマルチモーダル推論のための高価値データ選択

Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

要旨

Support