m1: 大規模言語モデルを用いた医療推論におけるテストタイムスケーリングの可能性を解き放つ

要旨

テストタイムスケーリングは、大規模言語モデルの推論能力を向上させる強力な技術として登場しました。しかし、医療領域におけるその有効性は不確かです。なぜなら、医療領域は知識表現と意思決定プロセスにおいて数学的タスクとは根本的に異なるからです。本論文では、医療推論におけるテストタイムスケーリングの初めての包括的な調査を提供し、推論時にモデルの医療推論能力を向上させるシンプルで効果的なアプローチであるm1を提案します。多様な医療タスクにわたる評価を通じて、テストタイムスケーリングが一貫して医療推論を向上させ、10Bパラメータ未満の軽量なファインチューニングモデルが新たな最先端の性能を確立し、32Bモデルが以前の70Bスケールの医療LLMに匹敵することを示します。しかし、約4Kの最適な推論トークン予算を特定し、それを超えると過剰思考により性能が低下する可能性があることがわかりました。反復プロンプトを通じてテストタイム計算を拡張する予算強制は、モデルが回答を再確認するのに役立ちますが、必ずしも全体的な医療QA性能を向上させるわけではなく、場合によっては以前に正しかった回答に誤りを導入することさえあります。ケースバイケースの分析により、テストタイムスケーリングを通じたさらなる性能向上を妨げる主要なボトルネックとして、不十分な医療知識が特定されました。データスケールの増加、データ品質の向上、モデル容量の拡大が一貫して医療知識の基盤を強化し、特に小規模モデルが飽和に達する困難な医療ベンチマークにおいて、継続的な性能向上を可能にすることがわかりました。これらの発見は、LLMにおける医療推論と数学的推論の根本的な違いを強調し、推論深度の増加だけでなく、豊富な医療知識がテストタイムスケーリングの利点を実現するために不可欠であることを示しています。

English

Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.

m1: 大規模言語モデルを用いた医療推論におけるテストタイムスケーリングの可能性を解き放つ

m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

要旨

Support