CodeARC：大型語言模型代理在歸納式程序合成中的推理能力基準測試

摘要

歸納式程序合成，或稱基於範例的編程，需要從輸入輸出範例中合成能夠泛化到未見輸入的函數。雖然大型語言模型代理在自然語言指導下的編程任務中展現出潛力，但其執行歸納式程序合成的能力尚未得到充分探索。現有的評估協議依賴於靜態的範例集和保留測試，在合成函數錯誤時不提供反饋，也未能反映如逆向工程等真實世界場景。我們提出了CodeARC，即代碼抽象與推理挑戰，這是一個新的評估框架，在此框架中，代理通過查詢新輸入與隱藏目標函數互動，合成候選函數，並利用差分測試預言機迭代改進其解決方案。這種互動式設置鼓勵代理基於反饋執行函數調用和自我修正。我們構建了首個大規模通用歸納式程序合成基準，包含1114個函數。在評估的18個模型中，o3-mini以52.7%的成功率表現最佳，凸顯了此任務的難度。在精心挑選的合成軌跡上微調LLaMA-3.1-8B-Instruct，可帶來高達31%的相對性能提升。CodeARC為評估基於LLM的程序合成與歸納推理提供了一個更為真實且具挑戰性的測試平台。

English

Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning.

CodeARC：大型語言模型代理在歸納式程序合成中的推理能力基準測試

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

摘要

Support