馬爾可夫大語言模型測試時縮放的思維原子

摘要

大型語言模型（LLMs）通過訓練時的規模擴展實現了卓越的性能，而測試時的規模擴展則進一步增強了它們在推理過程中進行有效推理的能力。然而，隨著推理規模的增大，現有的測試時規模擴展方法會受到累積歷史信息的影響，這不僅浪費了計算資源，還干擾了有效推理。為了解決這一問題，我們觀察到複雜的推理進程通常是通過解決一系列獨立的子問題來實現的，每個子問題都是自包含且可驗證的。這些子問題本質上是原子問題，主要依賴於其當前狀態而非累積的歷史，類似於馬爾可夫過程中的無記憶轉換。基於這一觀察，我們提出了「思維原子」（Atom of Thoughts, AoT），其中推理過程中的每個狀態轉換包括將當前問題分解為基於依賴的有向無環圖，並收縮其子問題，形成一個新的原子問題狀態。這種迭代的分解-收縮過程持續進行，直到達到可直接解決的原子問題，自然實現了問題狀態之間的馬爾可夫轉換。此外，這些原子問題可以無縫集成到現有的測試時規模擴展方法中，使AoT能夠作為插件增強來提升推理能力。在六個基準測試上的實驗證明了AoT作為獨立框架和插件增強的有效性。值得注意的是，在HotpotQA上，當應用於gpt-4o-mini時，AoT達到了80.6%的F1分數，超越了o3-mini 3.4%和DeepSeek-R1 10.6%。代碼將在https://github.com/qixucen/atom 上提供。

English

Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning progress is often achieved by solving a sequence of independent subquestions, each being self-contained and verifiable. These subquestions are essentially atomic questions, relying primarily on their current state rather than accumulated history, similar to the memoryless transitions in a Markov process. Based on this observation, we propose Atom of Thoughts (AoT), where each state transition in the reasoning process consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a new atomic question state. This iterative decomposition-contraction process continues until reaching directly solvable atomic questions, naturally realizing Markov transitions between question states. Furthermore, these atomic questions can be seamlessly integrated into existing test-time scaling methods, enabling AoT to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of AoT both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and DeepSeek-R1 by 10.6%. The code will be available at https://github.com/qixucen/atom.

馬爾可夫大語言模型測試時縮放的思維原子

Atom of Thoughts for Markov LLM Test-Time Scaling

摘要

Support