상호작용 벤치마크

초록

표준 벤치마크는 포화 상태, 주관성, 낮은 일반화 성능으로 인해 점점 더 신뢰할 수 없어지고 있습니다. 본 연구에서는 모델의 지능을 평가하기 위해 능동적 정보 습득 능력의 평가가 중요하다고 주장합니다. 예산 제약 하에서 상호작용 과정 속에서 모델의 추론 능력을 평가하는 통합 평가 패러다임인 인터랙티브 벤치마크를 제안합니다. 우리는 이 프레임워크를 두 가지 설정으로 구체화합니다. 첫째, 모델이 판사와 상호작용하여 논리 및 수학에서 객관적 진실이나 답을 추론하는 '인터랙티브 증명'과, 둘째, 모델이 장기적 효용을 극대화하기 위해 전략적으로 추론하는 '인터랙티브 게임'입니다. 우리의 결과는 인터랙티브 벤치마크가 모델 지능에 대한 강건하고 정확한 평가를 제공하며, 인터랙티브 시나리오에서 여전히 개선할 여지가 상당함을 보여줍니다. 프로젝트 페이지: https://github.com/interactivebench/interactivebench

English

Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench