互動式基準測試

摘要

由於飽和度、主觀性及泛化能力不足等問題，標準基準測試已變得越來越不可靠。我們主張，評估模型主動獲取資訊的能力對於衡量其智能水平至關重要。為此，我們提出「互動式基準測試」——一種在預算限制下透過互動過程評估模型推理能力的統一評估範式。我們在兩種情境中具體實現了該框架：其一是「互動式證明」，模型透過與評判者互動來推導邏輯與數學領域的客觀真理或答案；其二是「互動式遊戲」，模型需進行策略性推理以最大化長期效用。實驗結果表明，互動式基準測試能對模型智能提供穩健且真實的評估，並揭示出模型在互動情境中仍有顯著改進空間。項目頁面：https://github.com/interactivebench/interactivebench

English

Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench