X-Reasoner: モダリティとドメインを超えた汎用的な推論に向けて

要旨

最近のプロプライエタリモデル（例：o3）は、強力なマルチモーダル推論能力を示し始めています。しかし、既存のオープンソース研究のほとんどは、テキストのみの推論モデルの訓練に集中しており、評価も主に数学的および一般領域のタスクに限定されています。そのため、テキスト入力や一般領域を超えて推論能力を効果的に拡張する方法はまだ不明確です。本論文では、基本的な研究課題を探求します：推論はモダリティや領域を超えて一般化可能か？我々の研究結果は、肯定的な答えを支持します：一般領域のテキストベースのポストトレーニングが、そのような強力な一般化可能な推論を可能にします。この発見を活用して、我々はX-Reasonerを紹介します。これは、一般領域のテキストのみでポストトレーニングされた視覚言語モデルで、一般化可能な推論を実現するために、2段階のアプローチを採用しています：最初に蒸留された長い連鎖思考（chain-of-thoughts）を用いた教師ありファインチューニングフェーズを行い、その後検証可能な報酬を用いた強化学習を行います。実験結果は、X-Reasonerがマルチモーダルおよび領域外の設定に推論能力を成功裏に転移させ、様々な一般および医療ベンチマークにおいて、領域内およびマルチモーダルデータで訓練された既存の最先端モデルを凌駕することを示しています（図1）。さらに、X-Reasonerの専門領域での性能は、領域固有のテキストのみのデータを用いた継続的な訓練によってさらに向上させることができることがわかりました。これを基に、我々はX-Reasoner-Medを紹介します。これは医療専門のバリアントで、多数のテキストのみおよびマルチモーダルの医療ベンチマークにおいて新たな最先端を達成します。

English

Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.

X-Reasoner: モダリティとドメインを超えた汎用的な推論に向けて

X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

要旨

Summary

Support

Support