言語モデルと第2の意見のユースケース：ポケットプロフェッショナル

要旨

この研究は、特に経験豊富な医師でも同僚の助言を求める複雑な医療事例に焦点を当て、大規模言語モデル（LLMs）の役割を専門家の意思決定における公式の第二意見ツールとしてテストしています。研究では、Medscapeから20か月間にわたり183の難解な医療事例を分析し、複数のLLMsのパフォーマンスをクラウドソーシングされた医師の回答と比較しました。最新の基盤モデルにおいて高い全体的なスコアが可能であるという重要な発見がありました（合意意見と比較して80％以上の精度）、これは同じ臨床事例（患者プロファイル、検査結果450ページ）に関する多くの人間の指標を上回っています。研究は、明快な事例（精度81％以上）と複雑なシナリオ（精度43％）の間のLLMsのパフォーマンスの格差を評価しました。特に、これらの事例は人間の医師の間で議論が活発になるものでした。この研究は、LLMsが主要な診断ツールとしてではなく包括的な鑑別診断の生成者として有用である可能性があり、臨床的な意思決定における認知バイアスを緩和し、認知負荷を軽減し、医療エラーの一部の原因を取り除くのに役立つことを示しています。第二の比較的な法的データセット（最高裁判所の事例、N=21）の追加は、第二意見を促進するためのAIの使用に対する経験的な文脈を提供しますが、これらの法的課題はLLMsが分析するのがはるかに容易であることが証明されました。LLMsの精度に関する元の貢献に加えて、この研究は、LLMsと異なる意見を持つ人間の実務者の間の高度に争われた質問と回答の信頼性を評価するための新しい基準を集約しました。これらの結果は、専門家の設定でのLLMsの最適な展開が、ルーチン業務の自動化を重視する現在のアプローチと大きく異なる可能性があることを示唆しています。

English

This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making, particularly focusing on complex medical cases where even experienced physicians seek peer consultation. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses. A key finding was the high overall score possible in the latest foundational models (>80% accuracy compared to consensus opinion), which exceeds most human metrics reported on the same clinical cases (450 pages of patient profiles, test results). The study rates the LLMs' performance disparity between straightforward cases (>81% accuracy) and complex scenarios (43% accuracy), particularly in these cases generating substantial debate among human physicians. The research demonstrates that LLMs may be valuable as generators of comprehensive differential diagnoses rather than as primary diagnostic tools, potentially helping to counter cognitive biases in clinical decision-making, reduce cognitive loads, and thus remove some sources of medical error. The inclusion of a second comparative legal dataset (Supreme Court cases, N=21) provides added empirical context to the AI use to foster second opinions, though these legal challenges proved considerably easier for LLMs to analyze. In addition to the original contributions of empirical evidence for LLM accuracy, the research aggregated a novel benchmark for others to score highly contested question and answer reliability between both LLMs and disagreeing human practitioners. These results suggest that the optimal deployment of LLMs in professional settings may differ substantially from current approaches that emphasize automation of routine tasks.

言語モデルと第2の意見のユースケース：ポケットプロフェッショナル

Language Models And A Second Opinion Use Case: The Pocket Professional

要旨

Support