生成能量竞技场(GEA):将能量意识融入大型语言模型(LLM)的人类评估中
The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations
July 17, 2025
作者: Carlos Arriaga, Gonzalo Martínez, Eneko Sendin, Javier Conde, Pedro Reviriego
cs.AI
摘要
大型语言模型的评估是一项复杂的任务,已有多种方法被提出。最常见的是使用自动化基准测试,其中大型语言模型需要回答不同主题的多项选择题。然而,这种方法存在一定的局限性,最令人担忧的是与人类判断的相关性较差。另一种方法是让人类评估大型语言模型,但这带来了可扩展性问题,因为需要评估的模型数量庞大且不断增长,使得基于招募评估者并让他们对模型响应进行排名的传统研究变得不切实际(且成本高昂)。另一种替代方法是使用公共竞技场,例如流行的LM竞技场,任何用户都可以自由地在任何问题上评估模型,并对两个模型的响应进行排名,结果随后被整理成模型排名。大型语言模型的一个日益重要的方面是它们的能耗,因此,评估能源意识如何影响人类选择模型的决策具有重要意义。在本文中,我们介绍了GEA,即生成能源竞技场,这是一个在评估过程中纳入模型能耗信息的竞技场。我们还展示了使用GEA获得的初步结果,表明对于大多数问题,当用户了解能耗时,他们更倾向于选择更小、更节能的模型。这表明,对于大多数用户交互而言,更复杂、性能更优的模型所带来的额外成本和能耗,并未带来足以证明其使用合理性的响应质量提升。
English
The evaluation of large language models is a complex task, in which several
approaches have been proposed. The most common is the use of automated
benchmarks in which LLMs have to answer multiple-choice questions of different
topics. However, this method has certain limitations, being the most
concerning, the poor correlation with the humans. An alternative approach, is
to have humans evaluate the LLMs. This poses scalability issues as there is a
large and growing number of models to evaluate making it impractical (and
costly) to run traditional studies based on recruiting a number of evaluators
and having them rank the responses of the models. An alternative approach is
the use of public arenas, such as the popular LM arena, on which any user can
freely evaluate models on any question and rank the responses of two models.
The results are then elaborated into a model ranking. An increasingly important
aspect of LLMs is their energy consumption and, therefore, evaluating how
energy awareness influences the decisions of humans in selecting a model is of
interest. In this paper, we present GEA, the Generative Energy Arena, an arena
that incorporates information on the energy consumption of the model in the
evaluation process. Preliminary results obtained with GEA are also presented,
showing that for most questions, when users are aware of the energy
consumption, they favor smaller and more energy efficient models. This suggests
that for most user interactions, the extra cost and energy incurred by the more
complex and top-performing models do not provide an increase in the perceived
quality of the responses that justifies their use.