HelloBench：评估大型语言模型的长文本生成能力

摘要

近年来，大型语言模型（LLMs）在各种任务（例如长文本理解）中展现出显著的能力，并提出了许多基准。然而，我们观察到长文本生成能力并未得到充分研究。因此，我们引入了分层长文本生成基准（HelloBench），这是一个全面的、野外的、开放式基准，用于评估LLMs在生成长文本方面的性能。基于布鲁姆的分类法，HelloBench将长文本生成任务分为五个子任务：开放式问答、摘要、聊天、文本补全和启发式文本生成。此外，我们提出了分层长文本评估（HelloEval），这是一种与人类对齐的评估方法，可以显著减少人类评估所需的时间和精力，同时与人类评估保持高度相关性。我们在大约30个主流LLMs上进行了广泛实验，观察到当前LLMs缺乏长文本生成能力。具体来说，首先，无论指令是否包含明确或隐含的长度约束，我们观察到大多数LLMs无法生成超过4000个字的文本。其次，我们观察到虽然一些LLMs可以生成更长的文本，但存在许多问题（例如严重的重复和质量下降）。第三，为了展示HelloEval的有效性，我们将HelloEval与传统指标（例如ROUGE、BLEU等）和LLM作为评判者方法进行了比较，结果显示HelloEval与人类评估之间具有最高的相关性。我们在 https://github.com/Quehry/HelloBench 上发布了我们的代码。

English

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in https://github.com/Quehry/HelloBench.

HelloBench：评估大型语言模型的长文本生成能力

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

摘要

Support