DOTS：通过最优推理轨迹搜索在LLMs中学习动态推理

摘要

近年来，提升大型语言模型（LLMs）在推理方面的能力引起了广泛关注。先前的研究表明，各种提示策略在帮助LLMs进行推理（称为“推理动作”）方面是有效的，例如逐步思考、在回答前反思、使用程序求解以及它们的组合。然而，这些方法通常会将静态、预定义的推理动作统一应用于所有问题，而不考虑每个问题的特定特征或任务解决LLM的能力。在本文中，我们提出了一种名为DOTS的方法，通过针对每个问题的特定特征和任务解决LLM的固有能力，实现LLMs通过最佳推理轨迹搜索动态推理。我们的方法包括三个关键步骤：i）定义可以组合成各种推理动作轨迹的原子推理动作模块；ii）通过迭代探索和评估为特定任务解决LLM搜索每个训练问题的最佳动作轨迹；iii）使用收集到的最佳轨迹来训练LLM规划未知问题的推理轨迹。特别是，我们提出了两种学习范式，即，微调外部LLM作为规划器来指导任务解决LLM，或者直接微调具有内在推理动作规划能力的任务解决LLM。我们在八个推理任务上的实验表明，我们的方法始终优于静态推理技术和基准指导微调方法。进一步的分析显示，我们的方法使LLMs能够根据问题复杂性调整计算，将更深入的思考和推理分配给更困难的问题。

English

Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called "reasoning actions"), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

DOTS：通过最优推理轨迹搜索在LLMs中学习动态推理

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

摘要

Support