MedFuzz: 의학 질의응답에서 대규모 언어 모델의 견고성 탐구

초록

대형 언어 모델(LLM)은 의학 질의응답 벤치마크에서 인상적인 성능을 달성했습니다. 그러나 높은 벤치마크 정확도가 실제 임상 환경에서의 성능으로 일반화된다는 것을 의미하지는 않습니다. 의학 질의응답 벤치마크는 LLM 성능을 정량화하는 데 일관된 가정에 의존하지만, 이러한 가정은 임상 현장의 개방된 세계에서는 성립하지 않을 수 있습니다. 그럼에도 불구하고 LLM은 광범위한 지식을 학습하여, 유명 벤치마크의 비현실적인 가정과 상관없이 실용적인 조건에 일반화할 수 있는 능력을 갖추고 있습니다. 우리는 벤치마크 가정이 위반될 때 LLM 의학 질의응답 벤치마크 성능이 얼마나 잘 일반화되는지 정량화하고자 합니다. 구체적으로, 우리는 'MedFuzz'(의학 퍼징)라고 부르는 적대적 방법을 제시합니다. MedFuzz는 LLM을 혼란스럽게 하기 위해 벤치마크 질문을 수정하는 방법을 시도합니다. 우리는 MedQA 벤치마크에서 제시된 환자 특성에 대한 강력한 가정을 대상으로 이 접근 방식을 시연합니다. 성공적인 "공격"은 의학 전문가를 속이기 어려운 방식으로 벤치마크 항목을 수정하지만, LLM이 정답에서 오답으로 바뀌도록 "속이는" 경우입니다. 더 나아가, 우리는 성공적인 공격이 통계적으로 유의미한지 확인할 수 있는 순열 검정 기법을 제시합니다. 우리는 "MedFuzzed" 벤치마크에서의 성능과 개별적인 성공적인 공격을 활용하는 방법을 보여줍니다. 이러한 방법들은 LLM이 더 현실적인 환경에서 견고하게 작동할 수 있는 능력에 대한 통찰력을 제공할 가능성이 있습니다.

English

Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful "attacks" modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless "trick" the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a "MedFuzzed" benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.

MedFuzz: 의학 질의응답에서 대규모 언어 모델의 견고성 탐구

MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering

초록

Support