信息合成：面向大语言模型的信息引导基准生成框架

摘要

大型语言模型（LLMs）在推理和代码生成方面展现出显著进步，但如何高效创建评估这些能力的新基准仍具挑战。传统基准创建依赖人工劳动，这一过程既昂贵又耗时。此外，现有基准常会污染LLM训练数据，因此需要新颖多样的基准来准确评估其真实能力。本研究提出InfoSynth——一种基于信息论原理自动生成和评估推理基准的创新框架。我们提出基于KL散度和熵的指标，无需依赖昂贵的模型评估即可量化基准的新颖性与多样性。基于该框架，我们开发出端到端流程，通过遗传算法和迭代式代码反馈从种子数据集合成稳健的Python编程题目。我们的方法在97%的情况下能为新问题生成准确的测试用例与解决方案，且合成基准相较于种子数据集持续展现出更高新颖性与多样性。此外，该算法提供了控制生成题目新颖性/多样性与难度的方法。InfoSynth为构建高质量、新颖多样的LLM基准提供了可扩展的自验证流程。项目页面：https://ishirgarg.github.io/infosynth_web/

English

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

信息合成：面向大语言模型的信息引导基准生成框架

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

摘要

Support