DOTS：通過最佳推理軌跡搜索在LLM中學習動態推理

摘要

近年來，增強大型語言模型（LLMs）在推理方面的能力已經引起了顯著的關注。先前的研究已經證明了各種提示策略在幫助LLMs進行推理（稱為“推理行為”）方面的有效性，例如逐步思考、在回答之前反思、使用程序解決問題以及它們的組合。然而，這些方法通常將靜態的、預定義的推理行為均勻應用於所有問題，而沒有考慮到每個問題的具體特徵或任務解決LLM的能力。在本文中，我們提出了一種名為DOTS的方法，通過尋找最佳推理軌跡，以適應每個問題的具體特徵和任務解決LLM的固有能力，使LLMs能夠動態進行推理。我們的方法包括三個關鍵步驟：i）定義可以組成各種推理行為軌跡的原子推理行動模塊；ii）通過迭代探索和評估來為每個訓練問題尋找最佳行動軌跡，以適應特定任務解決LLM；iii）使用收集到的最佳軌跡來訓練LLM計劃未見問題的推理軌跡。特別是，我們提出了兩種學習範式，即對外部LLM進行微調作為引導任務解決LLM的計劃者，或者直接對具有內部化推理行動計劃能力的任務解決LLM進行微調。我們在八個推理任務上的實驗表明，我們的方法始終優於靜態推理技術和基本指令調整方法。進一步的分析顯示，我們的方法使LLMs能夠根據問題的複雜性調整其計算，將更深入的思考和推理分配給更難的問題。

English

Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called "reasoning actions"), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

DOTS：通過最佳推理軌跡搜索在LLM中學習動態推理

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

摘要

Support