m1：釋放測試時縮放技術在大型語言模型醫療推理中的潛力

摘要

測試時縮放技術已成為增強大型語言模型推理能力的一項強大技術。然而，其在醫學推理中的有效性仍不確定，因為醫學領域在知識表示和決策過程方面與數學任務存在根本性差異。本文首次全面探討了測試時縮放技術在醫學推理中的應用，並提出了m1這一簡單而有效的方法，該方法在推理階段提升了模型的醫學推理能力。我們在多樣化的醫學任務上的評估表明，測試時縮放技術持續增強了醫學推理，使得參數量低於100億的輕量級微調模型能夠建立新的性能標杆，而我們的320億參數模型則與之前700億參數規模的醫學大語言模型相媲美。然而，我們發現推理標記的最佳預算約為4K，超過此值性能可能因過度思考而下降。通過迭代提示延長測試時計算的預算強制，雖然有助於模型雙重檢查答案，但並不一定能提升整體醫學問答性能，在某些情況下甚至會將錯誤引入先前正確的回答中。我們的個案分析指出，醫學知識的不足是阻礙通過測試時縮放進一步提升性能的關鍵瓶頸。我們發現，增加數據規模、提升數據質量以及擴展模型容量，都能持續增強醫學知識的基礎，從而實現性能的持續提升，特別是在較小模型已達到飽和的挑戰性醫學基準測試上。這些發現強調了醫學與數學推理在大型語言模型中的根本差異，表明豐富的醫學知識，而非僅僅增加推理深度，對於實現測試時縮放技術的益處至關重要。

English

Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.

m1：釋放測試時縮放技術在大型語言模型醫療推理中的潛力

m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

摘要

Support