生成エネルギーアリーナ（GEA）：大規模言語モデル（LLM）の人間評価におけるエネルギー意識の統合

要旨

大規模言語モデルの評価は複雑な課題であり、これまでにいくつかのアプローチが提案されてきた。最も一般的なのは、自動化されたベンチマークを使用する方法で、LLMがさまざまなトピックの多肢選択問題に回答する必要がある。しかし、この方法には一定の限界があり、最も懸念されるのは人間との相関が低い点である。別のアプローチとして、人間がLLMを評価する方法がある。ただし、評価すべきモデルの数が多くかつ増加しているため、従来の研究のように評価者を募集し、モデルの応答をランク付けする方法は、スケーラビリティの問題を抱えており、非現実的（かつ高コスト）である。もう一つの代替案は、LMアリーナのような公開アリーナを利用する方法で、ユーザーは自由にモデルを評価し、任意の質問に対して2つのモデルの応答をランク付けできる。その結果はモデルのランキングとして集計される。LLMのエネルギー消費量はますます重要な側面となっており、エネルギー意識が人間のモデル選択にどのように影響するかを評価することは興味深い。本論文では、Generative Energy Arena（GEA）を紹介する。GEAは、評価プロセスにモデルのエネルギー消費量の情報を組み込んだアリーナである。GEAを用いた予備的な結果も提示されており、ユーザーがエネルギー消費量を認識している場合、ほとんどの質問において、より小型でエネルギー効率の高いモデルを好むことが示されている。これは、ほとんどのユーザーインタラクションにおいて、より複雑で高性能なモデルによる追加コストとエネルギー消費は、応答の知覚品質の向上を正当化するほどではないことを示唆している。

English

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

生成エネルギーアリーナ（GEA）：大規模言語モデル（LLM）の人間評価におけるエネルギー意識の統合

The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

要旨

Support