TAGS: 検証時汎用-専門家フレームワークと検索拡張推論・検証

要旨

最近のChain-of-Thoughtプロンプティングなどの進歩により、大規模言語モデル（LLMs）のゼロショット医療推論能力が大幅に向上しました。しかし、プロンプティングベースの手法はしばしば表面的で不安定であり、一方でファインチューニングされた医療LLMsは分布シフト下での汎化性能の低さや未見の臨床シナリオへの適応性の限界に悩まされています。これらの課題を解決するため、我々はTAGSを提案します。これは、モデルのファインチューニングやパラメータ更新を一切行わずに、汎用的な能力を持つジェネラリストとドメイン特化のスペシャリストを組み合わせて補完的な視点を提供するテストタイムフレームワークです。このジェネラリスト-スペシャリスト推論プロセスを支援するため、2つの補助モジュールを導入しました。1つは、セマンティックレベルと推論レベルの類似性に基づいて例を選択する階層的検索メカニズムで、もう1つは推論の一貫性を評価して最終的な回答集約を導く信頼性スコアラーです。TAGSは9つのMedQAベンチマークで強力な性能を発揮し、GPT-4oの精度を13.8%、DeepSeek-R1を16.8%向上させ、バニラの7Bモデルを14.1%から23.9%に改善しました。これらの結果は、パラメータ更新なしで、いくつかのファインチューニングされた医療LLMsを上回っています。コードはhttps://github.com/JianghaoWu/TAGSで公開予定です。

English

Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at https://github.com/JianghaoWu/TAGS.

TAGS: 検証時汎用-専門家フレームワークと検索拡張推論・検証

TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification

要旨

Support