生成能量競技場（GEA）：將能源意識融入大型語言模型（LLM）的人類評估中

摘要

大型語言模型的評估是一項複雜的任務，目前已提出多種方法。最常見的是使用自動化基準測試，其中大型語言模型需回答不同主題的選擇題。然而，這種方法存在一定限制，最令人擔憂的是其與人類判斷的相關性較低。另一種方法是讓人類來評估大型語言模型，但這面臨可擴展性問題，因為需要評估的模型數量龐大且不斷增長，使得基於招募評估者並讓他們對模型回應進行排名的傳統研究變得既不可行又成本高昂。另一種替代方案是使用公共競技場，例如流行的LM競技場，任何用戶都可以自由地對模型進行任何問題的評估，並對兩個模型的回應進行排名，最終結果會被整理成模型排名。大型語言模型的能源消耗日益成為一個重要方面，因此評估能源意識如何影響人類選擇模型的決策具有重要意義。在本文中，我們介紹了GEA（生成能源競技場），這是一個在評估過程中納入模型能源消耗信息的競技場。我們還展示了使用GEA獲得的初步結果，表明對於大多數問題，當用戶了解能源消耗時，他們更傾向於選擇更小、更節能的模型。這表明，對於大多數用戶互動而言，更複雜且性能頂尖的模型所帶來的額外成本和能源消耗，並未帶來足以證明其使用合理性的回應質量提升。

English

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

生成能量競技場（GEA）：將能源意識融入大型語言模型（LLM）的人類評估中

The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

摘要

Support