信息合成：面向大语言模型的信息引导基准生成

摘要

大型语言模型（LLM）在推理与代码生成领域已展现出显著进步，但如何高效创建评估这些能力的新基准仍具挑战。传统基准构建依赖人工操作，这一过程成本高昂且耗时漫长。此外，现有基准常会污染LLM训练数据，因此需要新颖多样的基准来准确评估其真实能力。本研究提出InfoSynth——一种基于信息论原理自动生成与评估推理基准的创新框架。我们基于KL散度与熵提出量化指标，可在无需昂贵模型评估的情况下衡量基准的新颖性与多样性。基于该框架，我们开发出端到端流程，通过遗传算法与迭代代码反馈从种子数据集合成稳健的Python编程题目。该方法对新问题生成准确测试用例与解决方案的成功率达97%，且合成基准相较于种子数据集持续展现出更高新颖性与多样性。此外，我们的算法还能控制生成题目的新颖性/多样性及难度。InfoSynth为构建高质量、新颖多样的LLM基准提供了可扩展的自验证流程。项目页面：https://ishirgarg.github.io/infosynth_web/

English

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

信息合成：面向大语言模型的信息引导基准生成

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

摘要

Support