HelloBench：大規模言語モデルの長文生成能力の評価

要旨

近年、大規模言語モデル（LLM）はさまざまなタスク（例：長文脈理解）で顕著な能力を示し、多くのベンチマークが提案されています。しかし、長文生成能力はあまり十分に調査されていないことが観察されています。そのため、Hierarchical Long Text Generation Benchmark（HelloBench）を導入しました。これは、LLMの長文生成パフォーマンスを評価する包括的で実践的なベンチマークであり、オープンエンドのものです。Bloomの分類法に基づいて、HelloBenchは長文生成タスクをオープンエンドQA、要約、チャット、テキスト補完、ヒューリスティックテキスト生成の5つのサブタスクに分類しています。さらに、Hierarchical Long Text Evaluation（HelloEval）を提案しており、この方法は、人間の評価に必要な時間と労力を大幅に削減しながら、人間の評価と高い相関を維持しています。約30の主要なLLMを対象とした包括的な実験を実施し、現在のLLMには長文生成能力が欠如していることを観察しました。具体的には、指示に明示的または暗黙の長さ制約が含まれているかどうかに関係なく、ほとんどのLLMが4000語よりも長いテキストを生成できないことを観察しています。また、一部のLLMがより長いテキストを生成できる一方、重複や品質の劣化など多くの問題が存在することも観察しています。さらに、HelloEvalの効果を実証するために、HelloEvalを従来のメトリクス（例：ROUGE、BLEUなど）やLLM-as-a-Judge方法と比較し、HelloEvalが人間の評価と最も高い相関を持つことを示しています。私たちのコードはhttps://github.com/Quehry/HelloBenchで公開しています。

English

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in https://github.com/Quehry/HelloBench.

HelloBench：大規模言語モデルの長文生成能力の評価

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

要旨

Support