段階的かつ検証可能な医療推論をMLLMで強化する

要旨

マルチモーダル大規模言語モデル（MLLM）は、一般的なタスクにおいて堅牢な推論能力を示し始めていますが、医療分野への応用はまだ初期段階にあります。医療MLLMの推論能力を強化するためには、連鎖的思考（CoT）のトレーニングデータを構築することが不可欠です。しかし、既存のアプローチでは、重要な診断に向けた効果的な推論パスを検索し評価するための包括的なフレームワークが不足しています。この課題に対処するため、我々はMentor-Intern Collaborative Search（MICS）を提案します。これは、厳密で効果的な医療CoTデータを生成するための新しい推論パス検索スキームです。MICSはまず、メンターモデルを活用して推論を段階的に初期化し、次に各インタンモデルにそれらの開始されたパスに沿って思考を続けるよう促し、最後に複数のインタンモデルの全体的な推論性能に基づいて最適な推論パスを選択します。推論性能は、生成された推論パスの品質を評価するMICS-Scoreによって決定されます。最終的に、我々は難易度がランク付けされた多タスク医療推論データセットMMRPと、カリキュラム学習戦略を通じて設計された新しい医療MLLMであるChiron-o1を構築しました。Chiron-o1は、視覚的質問応答と一般化可能な推論能力を備えています。広範な実験により、MICSを使用して構築されたCoTデータセットでトレーニングされたChiron-o1が、一連の医療視覚的質問応答および推論ベンチマークにおいて最先端の性能を達成することが実証されました。コードはGitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMsで公開されています。

English

Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at GitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs