作弊自動LLM基準測試:空模型取得高勝率
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
October 9, 2024
作者: Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin
cs.AI
摘要
自動 LLM 基準,如 AlpacaEval 2.0、Arena-Hard-Auto 和 MT-Bench,因其與人類評估相比的成本效益和可擴展性而變得流行,用於評估語言模型。在這些基準上取得高勝率可以顯著提升新發布的語言模型的推廣影響。這種推廣效益可能會激勵一些技巧,例如操縱模型輸出長度或風格以提高勝率,即使已開發了幾種機制來控制長度並解開風格以減少可遊戲性。然而,我們發現,即使是一個總是輸出恆定回應(與輸入指令無關)的“空模型”也可以欺騙自動基準並取得排名靠前的勝率:在 AlpacaEval 2.0 上達到 86.5% 的 LC 勝率;在 Arena-Hard-Auto 上達到 83.0 分;在 MT-Bench 上達到 9.55 分。此外,精心製作的欺騙輸出是可轉移的,因為我們假設這些基準的指令(例如 AlpacaEval 2.0 的 805 個樣本)是私有的且無法訪問。雖然我們的實驗主要是概念證明,但對手可以利用 LLM 生成更不可察覺的欺騙回應,不道德地從高勝率和推廣影響中受益。我們的發現呼籲為可靠的自動基準開發反作弊機制。代碼可在 https://github.com/sail-sg/Cheating-LLM-Benchmarks 找到。
English
Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and
MT-Bench, have become popular for evaluating language models due to their
cost-effectiveness and scalability compared to human evaluation. Achieving high
win rates on these benchmarks can significantly boost the promotional impact of
newly released language models. This promotional benefit may motivate tricks,
such as manipulating model output length or style to game win rates, even
though several mechanisms have been developed to control length and disentangle
style to reduce gameability. Nonetheless, we show that even a "null model" that
always outputs a constant response (irrelevant to input instructions) can cheat
automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on
AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench.
Moreover, the crafted cheating outputs are transferable because we assume that
the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are
private and cannot be accessed. While our experiments are primarily
proof-of-concept, an adversary could use LLMs to generate more imperceptible
cheating responses, unethically benefiting from high win rates and promotional
impact. Our findings call for the development of anti-cheating mechanisms for
reliable automatic benchmarks. The code is available at
https://github.com/sail-sg/Cheating-LLM-Benchmarks.Summary
AI-Generated Summary