如何讓你的大型語言模型生成具挑戰性的評估問題

摘要

大型語言模型（LLMs）的快速演進，亟需新的方法來進行嚴謹且全面的評估。由於生成高品質、具挑戰性問題的複雜性與成本，傳統的人工標註已日益不可行。在本研究中，我們提出了CHASE，這是一個無需人工介入、利用LLMs合成生成挑戰性問題的統一框架。針對特定任務，我們的方法從簡單的組件自下而上地構建難題。此外，我們的框架將生成過程分解為可獨立驗證的子任務，從而確保了高水準的品質與正確性。我們實作了CHASE，在三個不同領域創建了評估基準：(1) 基於文件的問答，(2) 倉庫層級的程式碼補全，以及(3) 數學推理。頂尖LLMs在這些合成基準上的表現準確率介於40%至60%之間，這證明了我們的框架在生成挑戰性問題方面的有效性。我們公開釋出了我們的基準與程式碼。

English

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

如何讓你的大型語言模型生成具挑戰性的評估問題

How to Get Your LLM to Generate Challenging Problems for Evaluation

摘要

Support