A2RBench: 형식적으로 검증 가능한 추상 추론 벤치마크 생성을 위한 자동 패러다임

초록

추상 추론 능력은 대규모 언어 모델(LLM)이 추상적 규칙을 추출하고 적용하는 지능 및 일반화 능력을 반영한다. 그러나 이러한 능력을 정확히 측정하는 것은 여전히 어려운 과제로, 기존 벤치마크는 비용이 많이 드는 수동 주석에 의존하여 규모가 제한되거나, 진정한 추론이 아닌 암기(memorization)를 측정할 위험이 있다. 이를 해결하기 위해, 우리는 생성(generation), 확장(expansion), 평가(evaluation), 분석(analysis)을 포괄하는 자동화된 파이프라인인 A2RBench를 소개한다. 구체적으로, 생성 단계에서는 LLM이 진정한 추론을 요구하는 다양한 작업을 만들고, 확장 단계에서는 LLM이 검증된 규칙을 재사용하고 새로운 입력 공간을 확장하여 작업 변형을 생성함으로써 확장성을 달성한다. 그러나 이러한 과정은 환각(hallucination)을 초래할 수 있다. 이를 제거하기 위해, 우리는 추가로 이론적 프레임워크를 구축하고, 역연산이 순방향 연산을 완벽히 역전시키는지(순환 일관성, cycle consistency)를 테스트하는 프로그램적 검증이 유일한 해를 보장함을 증명한다. 주요 LLM에 대한 광범위한 평가를 통해 우리는 다음과 같은 사실을 발견했다: (1) 현재 LLM은 추상 추론에 있어 근본적인 결함을 보이며, 최고 모델조차도 대표적인 부분 집합에서 인간에 비해 현저히 낮은 성능을 보인다(39.8% 대 68.5%). (2) 현재 LLM은 생성된 3차원 작업의 복잡성에서 2차원 및 1차원 수준에 크게 미치지 못하며, 이는 고차원 작업에 대한 이해 부족을 드러낸다. (3) 직관과 반대로, 정보 복잡성이 높은 입력이 오히려 추론 과정을 단순화할 수 있다.

English

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)--guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.