단일 문자 하나가 LLM 평가를 좌우할 수 있다

초록

일반적인 대형 언어 모델(LLM) 평가는 모델의 응답을 원하는 스타일로 유도하기 위해 데모 예제에 의존합니다. 사용되는 예제의 수는 연구되고 표준화되었지만, 예제를 어떻게 포맷할지에 대한 선택은 덜 조사되었습니다. 평가 프로토콜과 실제 사용에서 사용자들은 컨텍스트 내 예제를 어떻게 구분할지 선택해야 합니다: 쉼표를 사용할까? 새 줄을 사용할까? 세미콜론을 사용할까? 해시태그를 사용할까? 등. 놀랍게도, 이렇게 사소해 보이는 선택이 모델 응답 품질에 극적인 영향을 미칠 수 있음을 발견했습니다. 주요 모델 패밀리(Llama, Qwen, Gemma)에서 MMLU 성능은 구분자 선택에 따라 최대 ±23%까지 변동할 수 있습니다. 사실, 예제를 구분하는 단일 문자만 수정함으로써 모델 순위를 조작하여 어떤 모델이든 선두에 놓을 수 있습니다. LLM의 취약성은 주제와 모델 패밀리 전반에 걸쳐 존재하며, 모델 규모가 커져도 개선되지 않음을 발견했습니다. 어텐션 헤드 점수를 탐구한 결과, 성능이 좋은 구분자는 입력의 핵심 토큰에 주의를 유도하는 것으로 나타났습니다. 마지막으로, 구분자 선택에 대한 LLM의 견고성을 개선하는 방법을 탐구했습니다. 선택한 구분자를 프롬프트에 명시하면 견고성이 향상되며, 최고 성능을 보이는 구분자를 선택하기 위한 실용적인 권장 사항을 제시합니다.

English

Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by pm 23% depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

단일 문자 하나가 LLM 평가를 좌우할 수 있다

A Single Character can Make or Break Your LLM Evals

초록

Support