A2RBench：一种用于形式化可验证的抽象推理基准生成的自动化范式

摘要

抽象推理能力体现大语言模型提取和应用抽象规则的智能与泛化能力。然而，准确测量这一能力仍面临挑战：现有基准测试要么依赖昂贵的人工标注，限制了扩展规模，要么存在衡量记忆而非真正推理的风险。为解决这一问题，我们提出了名为A2RBench的自动化流程，涵盖生成、扩展、评估和分析四个阶段。具体而言，在生成阶段，大语言模型创建需要真实推理的多样化任务；在扩展阶段，大语言模型复用已验证的规则并扩展新的输入空间以生成任务变体，实现规模化。但这一过程可能引发幻觉。为消除幻觉，我们进一步建立理论框架，证明程序化验证——测试逆操作能否完美逆转正向操作（循环一致性）——可确保唯一解。通过对主流大语言模型的广泛评估，我们发现：（1）当前大语言模型在抽象推理中存在根本性缺陷，顶尖模型在代表性子集上的表现显著低于人类（39.8%对比68.5%）。（2）当前大语言模型在生成的三维任务复杂度方面远不及二维和一维任务，揭示其对高维任务理解不足。（3）反直觉的是，信息复杂度更高的输入反而能简化推理过程。

English

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)--guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.