ARB: 대규모 언어 모델을 위한 고급 추론 벤치마크

초록

대형 언어 모델(LLMs)은 다양한 정량적 추론 및 지식 벤치마크에서 뛰어난 성능을 보여왔습니다. 그러나 이러한 벤치마크 중 상당수는 LLMs가 점점 더 높은 점수를 기록함에 따라 유용성을 잃어가고 있으며, 이러한 분야에서 아직 전문가 수준의 성능에 도달하지 못한 상태입니다. 우리는 여러 분야의 고급 추론 문제로 구성된 새로운 벤치마크인 ARB를 소개합니다. ARB는 수학, 물리학, 생물학, 화학, 법학 등 다양한 분야의 문제를 포함하여 기존 벤치마크보다 더 도전적인 테스트를 제공합니다. ARB의 하위 집합으로, 고급 기호 추론과 도메인 지식을 요구하는 수학 및 물리학 문제 세트를 도입했습니다. 우리는 GPT-4와 Claude와 같은 최신 모델을 ARB에서 평가하고, 현재 모델들이 더 까다로운 작업에서 50% 미만의 점수를 기록함을 보여줍니다. 자동 및 보조 평가 능력을 개선하기 위해, GPT-4가 자신의 중간 추론 단계를 평가할 수 있는 루브릭 기반 평가 방식을 도입했습니다. 또한, ARB의 기호 추론 하위 집합에 대한 인간 평가를 수행하여, 평가자와 GPT-4 루브릭 평가 점수 간에 유망한 일치를 발견했습니다.

English

Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more challenging test than prior benchmarks, featuring problems in mathematics, physics, biology, chemistry, and law. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks. In order to improve both automatic and assisted evaluation capabilities, we introduce a rubric-based evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps. Further, we conduct a human evaluation of the symbolic subset of ARB, finding promising agreement between annotators and GPT-4 rubric evaluation scores.

ARB: 대규모 언어 모델을 위한 고급 추론 벤치마크

ARB: Advanced Reasoning Benchmark for Large Language Models

초록

Support