MedFuzz: 医療質問応答における大規模言語モデルの堅牢性の探求

要旨

大規模言語モデル（LLM）は、医療質問応答ベンチマークにおいて印象的な性能を達成しています。しかし、高いベンチマーク精度は、その性能が現実世界の臨床環境に一般化することを意味するわけではありません。医療質問応答ベンチマークは、LLMの性能を定量化するために一貫した仮定に依存していますが、これらの仮定は臨床のオープンワールドでは成り立たない可能性があります。それでも、LLMは広範な知識を学習しており、称賛されるベンチマークにおける非現実的な仮定に関わらず、実践的な条件に一般化するのに役立つことができます。我々は、ベンチマークの仮定が破られた場合に、LLMの医療質問応答ベンチマーク性能がどれだけ一般化するかを定量化することを目指しています。具体的には、我々はMedFuzz（医療ファジング）と呼ぶ敵対的手法を提案します。MedFuzzは、LLMを混乱させることを目的としてベンチマーク質問を修正しようと試みます。我々は、MedQAベンチマークで提示された患者特性に関する強い仮定をターゲットにすることで、このアプローチを実証します。成功した「攻撃」は、医療専門家を騙すことはないが、それでもLLMを正しい答えから誤った答えに変更させるような方法でベンチマーク項目を修正します。さらに、成功した攻撃が統計的に有意であることを保証する順列検定手法を提示します。我々は、「MedFuzzされた」ベンチマークでの性能、および個々の成功した攻撃をどのように使用するかを示します。これらの手法は、LLMがより現実的な設定で堅牢に動作する能力に関する洞察を提供する可能性を示しています。

English

Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful "attacks" modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless "trick" the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a "MedFuzzed" benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.

MedFuzz: 医療質問応答における大規模言語モデルの堅牢性の探求

MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering

要旨

Support