マルコフLLMのテスト時スケーリングのための思考の原子

要旨

大規模言語モデル（LLMs）は、訓練時のスケーリングによって優れた性能を達成し、推論時に効果的な推論を行うことで、テスト時のスケーリングがさらにその能力を向上させます。しかし、推論の規模が大きくなるにつれて、既存のテスト時スケーリング手法は蓄積された履歴情報に悩まされ、計算リソースを浪費するだけでなく、効果的な推論を妨げます。この問題を解決するために、複雑な推論プロセスはしばしば独立したサブ質問のシーケンスを解決することで達成され、各サブ質問は自己完結的で検証可能であることに注目しました。これらのサブ質問は本質的に原子質問であり、蓄積された履歴ではなく主に現在の状態に依存します。これはマルコフ過程における無記憶遷移に似ています。この観察に基づいて、我々は「思考の原子」（Atom of Thoughts, AoT）を提案します。ここでは、推論プロセスにおける各状態遷移は、現在の質問を依存関係に基づく有向非巡回グラフに分解し、そのサブ質問を収縮して新しい原子質問状態を形成することから成ります。この分解・収縮プロセスは、直接解決可能な原子質問に到達するまで繰り返され、質問状態間のマルコフ遷移を自然に実現します。さらに、これらの原子質問は既存のテスト時スケーリング手法にシームレスに統合でき、AoTを推論能力を向上させるプラグイン拡張として機能させることができます。6つのベンチマークでの実験により、AoTがスタンドアロンのフレームワークとしてもプラグイン拡張としても有効であることが示されました。特に、HotpotQAにおいて、gpt-4o-miniに適用した場合、AoTは80.6%のF1スコアを達成し、o3-miniを3.4%、DeepSeek-R1を10.6%上回りました。コードはhttps://github.com/qixucen/atomで公開されます。

English

Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning progress is often achieved by solving a sequence of independent subquestions, each being self-contained and verifiable. These subquestions are essentially atomic questions, relying primarily on their current state rather than accumulated history, similar to the memoryless transitions in a Markov process. Based on this observation, we propose Atom of Thoughts (AoT), where each state transition in the reasoning process consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a new atomic question state. This iterative decomposition-contraction process continues until reaching directly solvable atomic questions, naturally realizing Markov transitions between question states. Furthermore, these atomic questions can be seamlessly integrated into existing test-time scaling methods, enabling AoT to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of AoT both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and DeepSeek-R1 by 10.6%. The code will be available at https://github.com/qixucen/atom.

マルコフLLMのテスト時スケーリングのための思考の原子

Atom of Thoughts for Markov LLM Test-Time Scaling

要旨

Support