생성 에너지 아레나(GEA): 대규모 언어 모델(LLM) 인간 평가에 에너지 인식 통합

초록

대규모 언어 모델(LLM)의 평가는 복잡한 작업으로, 여러 접근 방식이 제안되어 왔습니다. 가장 일반적인 방법은 다양한 주제의 객관식 질문에 LLM이 답변해야 하는 자동화된 벤치마크를 사용하는 것입니다. 그러나 이 방법은 인간과의 상관관계가 낮다는 점이 가장 큰 문제로 지적됩니다. 대안적인 접근 방식으로는 인간이 LLM을 평가하는 방법이 있습니다. 이는 평가해야 할 모델의 수가 많고 계속 증가하고 있어, 평가자를 모집하고 모델의 응답을 순위 매기는 전통적인 연구를 실행하기에는 확장성 문제와 비용 문제가 발생합니다. 또 다른 대안은 LM 아레나와 같은 공개 아레나를 사용하는 것입니다. 이 아레나에서는 모든 사용자가 자유롭게 모델을 평가하고 두 모델의 응답을 순위 매길 수 있으며, 그 결과를 바탕으로 모델 순위를 산출합니다. LLM의 에너지 소비는 점점 더 중요한 측면이 되고 있으며, 따라서 에너지 인식이 인간의 모델 선택에 미치는 영향을 평가하는 것은 흥미로운 주제입니다. 본 논문에서는 모델의 에너지 소비 정보를 평가 과정에 통합한 GEA(Generative Energy Arena)를 소개합니다. GEA를 통해 얻은 예비 결과를 제시하며, 대부분의 질문에서 사용자들이 에너지 소비를 인지할 때 더 작고 에너지 효율적인 모델을 선호한다는 것을 보여줍니다. 이는 대부분의 사용자 상호작용에서 더 복잡하고 최고 성능을 내는 모델이 추가 비용과 에너지를 소비하더라도 응답의 질이 그만큼 향상되지 않아 사용을 정당화하기 어렵다는 것을 시사합니다.

English

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

생성 에너지 아레나(GEA): 대규모 언어 모델(LLM) 인간 평가에 에너지 인식 통합

The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

초록

Support