交互式基准测试

摘要

由于基准测试存在饱和性、主观性及泛化能力不足等问题，其可靠性正日益受到质疑。我们认为，评估模型主动获取信息的能力对衡量其智能水平至关重要。为此提出交互式基准测试框架——一种在预算约束下通过交互过程评估模型推理能力的统一范式。我们在两种场景中实现了该框架：交互式证明（模型通过与裁判互动推演逻辑与数学领域的客观真理）和交互式博弈（模型通过策略性推理实现长期效用最大化）。实验结果表明，交互式基准测试能对模型智能进行稳健且真实的评估，同时揭示出模型在交互场景中仍存在显著提升空间。项目页面：https://github.com/interactivebench/interactivebench

English

Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench