自動LLMベンチマークの不正行為：ヌルモデルが高い勝率を達成

要旨

自動LLMベンチマーク、例えばAlpacaEval 2.0、Arena-Hard-Auto、およびMT-Benchなどは、人間の評価と比較してコスト効率が高くスケーラブルであるため、言語モデルの評価において人気があります。これらのベンチマークで高い勝率を達成することは、新しくリリースされた言語モデルの宣伝効果を大幅に向上させることができます。この宣伝上の利点は、出力長やスタイルを操作して勝率を操作するなどのトリックを促す可能性がありますが、出力長を制御しスタイルを分離するためのいくつかのメカニズムが開発されているにもかかわらず、ゲーム性を低減させるために。それにもかかわらず、私たちは、入力の指示に関係ない一定の応答を常に出力する「ヌルモデル」でさえ、自動ベンチマークをだまし、トップランクの勝率を達成できることを示します：AlpacaEval 2.0で86.5％のLC勝率、Arena-Hard-Autoで83.0のスコア、MT-Benchで9.55のスコア。さらに、作成された不正行為の出力は移植可能であり、これらのベンチマークの指示（例：AlpacaEval 2.0の805サンプル）がプライベートでアクセスできないと仮定しています。私たちの実験は主に概念実証ですが、敵対者はLLMを使用してより認識しにくい不正行為の応答を生成し、高い勝率と宣伝効果を不正に利用する可能性があります。私たちの発見は、信頼性のある自動ベンチマークのための不正防止メカニズムの開発を求めています。コードは以下で入手可能です：https://github.com/sail-sg/Cheating-LLM-Benchmarks。

English

Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a "null model" that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.

自動LLMベンチマークの不正行為：ヌルモデルが高い勝率を達成

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

要旨

Support