부정 행위하는 자동 LLM 벤치마크: 널 모델이 높은 승률을 달성합니다.

초록

자동 LLM 벤치마크인 AlpacaEval 2.0, Arena-Hard-Auto, 그리고 MT-Bench과 같은 벤치마크들은 인간 평가에 비해 비용 효율성과 확장성이 뛰어나기 때문에 언어 모델을 평가하는 데 인기를 끌고 있습니다. 이러한 벤치마크에서 높은 승률을 달성하는 것은 새로 출시된 언어 모델의 홍보 효과를 크게 향상시킬 수 있습니다. 이 홍보 혜택은 길들이기를 위해 모델 출력 길이나 스타일을 조작하는 등의 기술을 촉발할 수 있지만, 길이를 제어하고 스타일을 분리하여 게임성을 줄이기 위한 여러 메커니즘이 개발되었음에도 불구하고, 항상 일정한 응답을 출력하는 "무효 모델"조차도 자동 벤치마크를 속여 최상위 승률을 달성할 수 있음을 보여줍니다: AlpacaEval 2.0에서 86.5%의 LC 승률; Arena-Hard-Auto에서 83.0 점; MT-Bench에서 9.55 점을 달성했습니다. 게다가, 조작된 부정행위 출력물은 전이 가능하며, 이는 이러한 벤치마크의 지침(예: AlpacaEval 2.0의 805개 샘플)이 개인적이고 액세스할 수 없다고 가정하기 때문입니다. 우리의 실험은 주로 컨셉 증명이지만, 악의적인 측이 LLM을 사용하여 감지하기 어려운 부정행위 응답을 생성하고 높은 승률과 홍보 효과를 부당하게 얻을 수 있습니다. 우리의 연구 결과는 신뢰할 수 있는 자동 벤치마크를 위한 부정행위 방지 메커니즘의 개발을 요구합니다. 코드는 https://github.com/sail-sg/Cheating-LLM-Benchmarks에서 확인할 수 있습니다.

English

Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a "null model" that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.

부정 행위하는 자동 LLM 벤치마크: 널 모델이 높은 승률을 달성합니다.

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

초록

Support