評価のための挑戦的な問題をLLMに生成させる方法

要旨

大規模言語モデル（LLM）の進化の速度は、厳密かつ包括的な評価のための新しいアプローチを必要としています。従来の人間によるアノテーションは、高品質で難易度の高い問題を生成する際の複雑さとコストのため、ますます実用的ではなくなっています。本研究では、人間の介入なしにLLMを使用して難易度の高い問題を合成的に生成する統一フレームワークであるCHASEを紹介します。与えられたタスクに対して、我々のアプローチは、より単純なコンポーネントからボトムアップ方式で難しい問題を構築します。さらに、我々のフレームワークは生成プロセスを独立して検証可能なサブタスクに分解し、高い品質と正確性を確保します。CHASEを実装し、以下の3つの多様なドメインにわたる評価ベンチマークを作成しました：（1）ドキュメントベースの質問応答、（2）リポジトリレベルのコード補完、（3）数学的推論。これらの合成的ベンチマークにおける最先端のLLMの性能は40-60%の精度範囲にあり、我々のフレームワークが難易度の高い問題を生成する効果を実証しています。我々はベンチマークとコードを公開します。

English

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

評価のための挑戦的な問題をLLMに生成させる方法

How to Get Your LLM to Generate Challenging Problems for Evaluation

要旨

Support