放射線学におけるGPT-4の限界を探る

要旨

近年の汎用大規模言語モデル（LLM）の成功は、自然言語処理のパラダイムをドメインやアプリケーションを横断する統一的な基盤モデルへと大きく変革しました。本論文では、これまでで最も優れたLLMであるGPT-4の性能を、放射線レポートのテキストベースアプリケーションにおいて評価し、最先端（SOTA）の放射線学特化モデルと比較します。様々なプロンプト戦略を探索し、GPT-4を多様な一般的な放射線学タスクで評価した結果、GPT-4は現在のSOTA放射線学モデルを上回るか、少なくとも同等の性能を示すことがわかりました。ゼロショットプロンプティングでは、GPT-4はすでに、時間的文類似性分類（精度）および自然言語推論（F_1スコア）において、放射線学モデルに対して大幅な改善（約10%の絶対的向上）を達成しています。データセット固有のスタイルやスキーマを学習する必要があるタスク（例えば所見の要約）では、GPT-4は例ベースのプロンプティングにより改善し、教師ありSOTAと同等の性能を発揮します。ボード認定放射線科医との詳細なエラー分析により、GPT-4は十分なレベルの放射線学知識を有しており、微妙なドメイン知識を必要とする複雑な文脈でのみ稀にエラーが発生することが示されました。所見の要約において、GPT-4の出力は既存の手書きの印象と全体的に同等であることが確認されました。

English

The recent success of general-domain large language models (LLMs) has significantly changed the natural language processing paradigm towards a unified foundation model across domains and applications. In this paper, we focus on assessing the performance of GPT-4, the most capable LLM so far, on the text-based applications for radiology reports, comparing against state-of-the-art (SOTA) radiology-specific models. Exploring various prompting strategies, we evaluated GPT-4 on a diverse range of common radiology tasks and we found GPT-4 either outperforms or is on par with current SOTA radiology models. With zero-shot prompting, GPT-4 already obtains substantial gains (approx 10% absolute improvement) over radiology models in temporal sentence similarity classification (accuracy) and natural language inference (F_1). For tasks that require learning dataset-specific style or schema (e.g. findings summarisation), GPT-4 improves with example-based prompting and matches supervised SOTA. Our extensive error analysis with a board-certified radiologist shows GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context that require nuanced domain knowledge. For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.

放射線学におけるGPT-4の限界を探る

Exploring the Boundaries of GPT-4 in Radiology

要旨

Support