X-Reasoner: 모달리티와 도메인을 아우르는 일반화 가능한 추론을 향하여

초록

최근의 독점 모델들(예: o3)은 강력한 다중모달 추론 능력을 보이기 시작했습니다. 그러나 대부분의 기존 오픈소스 연구는 텍스트 전용 추론 모델 훈련에 집중하고 있으며, 평가도 주로 수학 및 일반 도메인 작업에 국한되어 있습니다. 따라서 텍스트 입력과 일반 도메인을 넘어서는 추론 능력을 효과적으로 확장하는 방법은 여전히 명확하지 않습니다. 본 논문은 다음과 같은 근본적인 연구 질문을 탐구합니다: 추론은 모달리티와 도메인 간에 일반화 가능한가? 우리의 연구 결과는 긍정적인 답을 지지합니다: 일반 도메인 텍스트 기반 사후 훈련은 이러한 강력한 일반화 가능한 추론을 가능하게 할 수 있습니다. 이 발견을 바탕으로, 우리는 X-Reasoner를 소개합니다. 이는 일반화 가능한 추론을 위해 일반 도메인 텍스트만으로 사후 훈련된 시각-언어 모델로, 두 단계 접근법을 사용합니다: 첫 번째 단계는 증류된 긴 사고 사슬을 사용한 지도 미세 조정 단계이고, 두 번째 단계는 검증 가능한 보상을 사용한 강화 학습 단계입니다. 실험 결과, X-Reasoner는 다중모달 및 도메인 외 설정에서 추론 능력을 성공적으로 전이하며, 다양한 일반 및 의료 벤치마크에서 도메인 내 및 다중모달 데이터로 훈련된 기존의 최첨단 모델들을 능가합니다(그림 1). 또한, X-Reasoner의 특수 도메인 성능은 도메인 특화 텍스트 전용 데이터에 대한 지속적인 훈련을 통해 더욱 향상될 수 있음을 발견했습니다. 이를 바탕으로, 우리는 X-Reasoner-Med를 소개합니다. 이는 의료 특화 변형 모델로, 수많은 텍스트 전용 및 다중모달 의료 벤치마크에서 새로운 최첨단 성능을 달성합니다.

English

Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.