作弊自動LLM基準測試：空模型取得高勝率

摘要

自動 LLM 基準，如 AlpacaEval 2.0、Arena-Hard-Auto 和 MT-Bench，因其與人類評估相比的成本效益和可擴展性而變得流行，用於評估語言模型。在這些基準上取得高勝率可以顯著提升新發布的語言模型的推廣影響。這種推廣效益可能會激勵一些技巧，例如操縱模型輸出長度或風格以提高勝率，即使已開發了幾種機制來控制長度並解開風格以減少可遊戲性。然而，我們發現，即使是一個總是輸出恆定回應（與輸入指令無關）的“空模型”也可以欺騙自動基準並取得排名靠前的勝率：在 AlpacaEval 2.0 上達到 86.5% 的 LC 勝率；在 Arena-Hard-Auto 上達到 83.0 分；在 MT-Bench 上達到 9.55 分。此外，精心製作的欺騙輸出是可轉移的，因為我們假設這些基準的指令（例如 AlpacaEval 2.0 的 805 個樣本）是私有的且無法訪問。雖然我們的實驗主要是概念證明，但對手可以利用 LLM 生成更不可察覺的欺騙回應，不道德地從高勝率和推廣影響中受益。我們的發現呼籲為可靠的自動基準開發反作弊機制。代碼可在 https://github.com/sail-sg/Cheating-LLM-Benchmarks 找到。

English

Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a "null model" that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.

作弊自動LLM基準測試：空模型取得高勝率

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

摘要

Support