InfoSynth: Sintesi Guidata da Informazioni di Benchmark per LLM

Abstract

I grandi modelli linguistici (LLM) hanno dimostrato progressi significativi nel ragionamento e nella generazione di codice. Tuttavia, creare efficientemente nuovi benchmark per valutare queste capacità rimane una sfida. La creazione tradizionale di benchmark si basa sullo sforzo umano manuale, un processo sia costoso che dispendioso in termini di tempo. Inoltre, i benchmark esistenti spesso contaminano i dati di addestramento degli LLM, rendendo necessari benchmark nuovi e diversificati per valutare accuratamente le loro capacità genuine. Questo lavoro introduce InfoSynth, un nuovo framework per la generazione automatica e la valutazione di benchmark di ragionamento guidati da principi di teoria dell'informazione. Proponiamo metriche basate sulla divergenza KL e sull'entropia per quantificare la novità e la diversità dei benchmark senza fare affidamento su costose valutazioni del modello. Sulla base di questo framework, sviluppiamo una pipeline end-to-end che sintetizza robusti problemi di programmazione Python da dataset di partenza utilizzando algoritmi genetici e feedback iterativo sul codice. Il nostro metodo genera casi di test e soluzioni accurati per nuovi problemi il 97% delle volte, e i benchmark sintetizzati mostrano costantemente una novità e una diversità più elevate rispetto ai loro dataset di partenza. Inoltre, il nostro algoritmo fornisce un metodo per controllare la novità/diversità e la difficoltà dei problemi generati. InfoSynth offre una pipeline scalabile e auto-verificante per costruire benchmark di alta qualità, nuovi e diversificati per gli LLM. Pagina del progetto: https://ishirgarg.github.io/infosynth_web/

English

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

InfoSynth: Sintesi Guidata da Informazioni di Benchmark per LLM

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Abstract

Support