소수에서의 진실: 효율적인 다중 모달 추론을 위한 고가치 데이터 선택

초록

다중 모달 대형 언어 모델(MLLMs)은 강화 학습을 통해 복잡한 추론 작업에서 상당한 진전을 이루었으나, 다중 모달 추론 능력을 향상시키기 위해서는 방대한 양의 학습 데이터가 필요하며, 이는 필연적으로 데이터 중복과 상당한 계산 비용을 초래한다는 것이 일반적인 믿음이다. 그러나 더 작은 고가치 데이터셋이 MLLMs의 다중 모달 추론에서 전체 코퍼스와 동등하거나 더 나은 성능을 발휘할 수 있을까? 본 연구에서는 이러한 가정에 도전한다. 우리는 의미 있는 다중 모달 추론이 학습 샘플 중 희소한 부분집합, 즉 인지 샘플(cognitive samples)에 의해 촉발되며, 대다수의 샘플은 미미한 기여를 한다는 핵심 관찰을 바탕으로 한다. 이러한 통찰을 기반으로, 우리는 Reasoning Activation Potential (RAP)이라는 새로운 데이터 선택 패러다임을 제안한다. RAP는 두 가지 상호 보완적인 추정기를 통해 각 샘플이 진정한 다중 모달 추론을 자극할 수 있는 잠재력을 평가하여 인지 샘플을 식별한다: 1) Causal Discrepancy Estimator (CDE)는 잠재 결과 모델 원리를 기반으로 다중 모달 입력과 텍스트 전용 입력 간의 출력을 비교하여 언어 사전 지식에 과도하게 의존하는 샘플을 제거한다; 2) Attention Confidence Estimator (ACE)는 토큰 수준의 자기 주의 메커니즘을 활용하여 중간 추론 단계에서 관련성이 없지만 과도하게 강조된 토큰에 지배되는 샘플을 제외한다. 또한, 우리는 Difficulty-aware Replacement Module (DRM)을 도입하여 사소한 인스턴스를 인지적으로 도전적인 것으로 대체함으로써 견고한 다중 모달 추론을 위한 복잡성을 보장한다. 6개의 데이터셋에서의 실험 결과, 우리의 RAP 방법은 학습 데이터의 9.3%만을 사용하면서도 계산 비용을 43% 이상 절감하며 일관되게 우수한 성능을 달성함을 보여준다. 우리의 코드는 https://github.com/Leo-ssl/RAP에서 확인할 수 있다.

English

While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.

소수에서의 진실: 효율적인 다중 모달 추론을 위한 고가치 데이터 선택

Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

초록

Support