TurtleBench：実世界のYes/Noパズルを通じてトップ言語モデルを評価する

要旨

大規模言語モデル（LLM）の適用が拡大するにつれて、信頼性の高い評価の需要が高まっています。既存のLLM評価ベンチマークは主に静的データセットに依存しており、モデルのパフォーマンスをユーザーとの動的な相互作用で評価することが難しい状況です。さらに、これらのベンチマークはしばしば特定の背景知識に依存しており、モデルの論理推論能力を測定することが複雑化しています。強力なモデルや手作業に基づく他の動的評価方法は、偏りを導入し、高いコストと時間を要するため、大規模な適用を妨げています。これらの問題に対処するために、私たちはTurtleBenchを提案します。TurtleBenchは、私たちが開発したオンラインTurtle Soup Puzzleプラットフォームから実際のユーザーの推測を収集します。このアプローチにより、比較的動的な評価データセットの生成が可能となり、モデルの不正行為のリスクを軽減しつつ、推論能力に関する真のユーザーのニーズに評価をより密接に合わせることができ、評価の信頼性が向上します。TurtleBenchには、1,532件のユーザーの推測と注釈後の推測の正誤が含まれています。このデータセットを使用して、現在利用可能な最も先進的な9つのLLMを徹底的に評価しました。特筆すべきは、OpenAI o1シリーズモデルがこれらの評価でトップの結果を達成しなかったことです。"o1の潜在的な推論が単純なChain-of-Thought（CoT）技術を利用している"や"CoTの長さを増やすことは推論上の利点を提供するだけでなく、ノイズコストも発生させる"など、さらなる研究のためのいくつかの仮説を提案しています。

English

As the application of Large Language Models (LLMs) expands, the demand for reliable evaluations increases. Existing LLM evaluation benchmarks primarily rely on static datasets, making it challenging to assess model performance in dynamic interactions with users. Moreover, these benchmarks often depend on specific background knowledge, complicating the measurement of a model's logical reasoning capabilities. Other dynamic evaluation methods based on strong models or manual efforts may introduce biases and incur high costs and time demands, hindering large-scale application. To address these issues, we propose TurtleBench. TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform that we developed. This approach allows for the relatively dynamic generation of evaluation datasets, mitigating the risk of model cheating while aligning assessments more closely with genuine user needs for reasoning capabilities, thus enhancing the reliability of evaluations. TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation. Using this dataset, we thoroughly evaluated nine of the most advanced LLMs available today. Notably, the OpenAI o1 series models did not achieve leading results in these evaluations. We propose several hypotheses for further research, such as "the latent reasoning of o1 utilizes trivial Chain-of-Thought (CoT) techniques" and "increasing CoT length not only provides reasoning benefits but also incurs noise costs."

TurtleBench：実世界のYes/Noパズルを通じてトップ言語モデルを評価する

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

要旨

Support