作弊自动LLM基准测试：空模型取得高胜率

摘要

自动LLM基准测试，如AlpacaEval 2.0、Arena-Hard-Auto和MT-Bench，因其与人工评估相比具有成本效益和可扩展性而变得流行。在这些基准测试中取得高胜率可以显著提升新发布语言模型的推广影响。这种推广效益可能会激励一些技巧，例如操纵模型输出长度或风格以提高胜率，尽管已经开发了几种机制来控制长度和解开风格以减少可玩性。然而，我们展示即使是一个始终输出恒定响应（与输入指令无关）的“空模型”也可以欺骗自动基准测试并取得排名靠前的胜率：在AlpacaEval 2.0上取得86.5%的LC胜率；在Arena-Hard-Auto上得分83.0；在MT-Bench上得分9.55。此外，精心制作的作弊输出是可转移的，因为我们假设这些基准测试的指令（例如AlpacaEval 2.0的805个样本）是私有的且无法访问。虽然我们的实验主要是概念验证，但对手可以利用LLM生成更不易察觉的作弊响应，不道德地从高胜率和推广影响中获益。我们的发现呼吁开发可靠自动基准测试的防作弊机制。代码可在https://github.com/sail-sg/Cheating-LLM-Benchmarks找到。

English

Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a "null model" that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.

作弊自动LLM基准测试：空模型取得高胜率

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

摘要

Support