信息合成:面向大语言模型的信息引导基准生成
InfoSynth: Information-Guided Benchmark Synthesis for LLMs
January 2, 2026
作者: Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song
cs.AI
摘要
大型语言模型(LLM)在推理与代码生成领域已展现出显著进步,但如何高效创建评估这些能力的新基准仍具挑战。传统基准构建依赖人工操作,这一过程成本高昂且耗时漫长。此外,现有基准常会污染LLM训练数据,因此需要新颖多样的基准来准确评估其真实能力。本研究提出InfoSynth——一种基于信息论原理自动生成与评估推理基准的创新框架。我们基于KL散度与熵提出量化指标,可在无需昂贵模型评估的情况下衡量基准的新颖性与多样性。基于该框架,我们开发出端到端流程,通过遗传算法与迭代代码反馈从种子数据集合成稳健的Python编程题目。该方法对新问题生成准确测试用例与解决方案的成功率达97%,且合成基准相较于种子数据集持续展现出更高新颖性与多样性。此外,我们的算法还能控制生成题目的新颖性/多样性及难度。InfoSynth为构建高质量、新颖多样的LLM基准提供了可扩展的自验证流程。项目页面:https://ishirgarg.github.io/infosynth_web/
English
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/