Medpromptからo1へ：医療課題およびそれ以上のランタイム戦略の探索

要旨

Medpromptなどの実行時ステアリング戦略は、困難なタスクで大規模言語モデル（LLM）を最高のパフォーマンスに導くのに貴重です。Medpromptは、プロンプトを使用して実行時戦略を誘発し、思考の連鎖推論とアンサンブルを含む方法で、一般的なLLMを医学などの専門領域で最先端のパフォーマンスを提供するように焦点を合わせることができることを示しています。OpenAIのo1-previewモデルは、最終的な応答を生成する前に実行時推論を行うように設計された新しいパラダイムを表しています。私たちは、o1-previewが様々な医学的チャレンジ問題のベンチマークでどのような振る舞いをするかを理解しようとしています。GPT-4とのMedprompt研究に続いて、私たちはo1-previewモデルをさまざまな医学的ベンチマークで体系的に評価します。特筆すべきことに、プロンプト技術を使用しなくても、o1-previewはMedpromptを使用したGPT-4シリーズを大幅に上回ることが多いです。私たちは、新しい推論モデルのパラダイム内でMedpromptに代表されるクラシックなプロンプトエンジニアリング戦略の効果を体系的に調査しました。few-shot promptingがo1のパフォーマンスを妨げることがわかり、文脈に即した学習は推論ネイティブモデルにとって効果的なステアリング手法ではなくなっている可能性を示唆しています。アンサンブルは引き続き有効ですが、リソースが多く必要であり、注意深いコストパフォーマンスの最適化が必要です。実行時戦略全体でのコストと精度の分析により、GPT-4oはより手頃な選択肢であり、o1-previewはより高いコストで最先端のパフォーマンスを達成していることが示されるペアレートフロンティアが明らかになります。o1-previewは最高のパフォーマンスを提供しますが、Medpromptなどのステアリング戦略を使用したGPT-4oは特定の文脈で価値を保持していることに留意します。さらに、o1-previewモデルが既存の多くの医学的ベンチマークでほぼ飽和状態に達していることを強調し、新しい、挑戦的なベンチマークの必要性を強調します。LLMとの推論時計算の一般的な方向性についての考察で締めくくります。

English

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

Medpromptからo1へ：医療課題およびそれ以上のランタイム戦略の探索

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

要旨

Support