결함인가, 아티팩트인가? LLM 평가에서 프롬프트 민감성 재고하기

초록

프롬프트 민감도는 동일한 내용을 다른 단어로 표현(즉, 패러프레이징)했을 때 대형 언어 모델(LLM)의 성능에 상당한 변화가 발생하는 현상을 의미하며, 이는 LLM의 핵심적인 한계로 널리 받아들여져 왔습니다. 본 연구에서는 이 문제를 재검토하며 다음과 같은 질문을 던집니다: 널리 보고된 높은 프롬프트 민감도는 정말로 LLM의 고유한 약점인가, 아니면 평가 과정에서 비롯된 부산물인가? 이 질문에 답하기 위해 우리는 GPT 및 Gemini 계열을 포함한 7개의 LLM을 12가지 다양한 프롬프트 템플릿에 대해 객관식 및 자유형 과제를 포함한 6개의 벤치마크에서 체계적으로 평가했습니다. 그 결과, 프롬프트 민감도의 상당 부분이 로그-우도 점수화 및 엄격한 답변 매칭과 같은 휴리스틱 평가 방법에서 비롯되었음을 발견했습니다. 이러한 방법들은 동의어나 패러프레이징과 같은 대체 표현을 통해 제시된 의미적으로 정확한 응답을 종종 간과합니다. 반면, LLM-as-a-Judge 평가 방식을 채택했을 때는 성능 변동이 크게 감소하고 프롬프트 간 모델 순위의 일관성이 높아지는 것을 관찰했습니다. 우리의 연구 결과는 현대 LLM이 이전에 생각했던 것보다 프롬프트 템플릿에 대해 더 강건하며, 프롬프트 민감도는 모델의 결함이라기보다는 평가 과정의 부산물일 가능성이 높다는 것을 시사합니다.

English

Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.

결함인가, 아티팩트인가? LLM 평가에서 프롬프트 민감성 재고하기

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

초록

Support