DOTS：最適推論軌跡探索を通じたLLMにおける動的推論学習

要旨

近年、大規模言語モデル（LLMs）の推論能力を向上させることが注目されています。以前の研究では、さまざまなプロンプティング戦略がLLMsの推論（「推論アクション」と呼ばれる）を支援する効果が示されており、段階的思考、回答前の考慮、プログラムを用いた解決、およびそれらの組み合わせが含まれます。しかし、これらのアプローチは、しばしば特定の質問の特性やタスク解決LLMの能力を考慮せず、すべての質問に対して静的で事前定義された推論アクションを一律に適用してきました。本論文では、各質問の特性とタスク解決LLMの固有の能力に合わせて最適な推論軌跡探索を通じて、LLMsが動的に推論することを可能にするアプローチであるDOTSを提案します。当該アプローチは、以下の3つの主要なステップを含みます：i）様々な推論アクション軌跡に組み込むことができる原子推論アクションモジュールの定義、ii）特定のタスク解決LLMに対する各訓練質問のための最適なアクション軌跡を探索するための反復的探索と評価、iii）収集された最適な軌跡を使用して、未知の質問の推論軌跡を計画するためのLLMの訓練。特に、外部LLMをプランナーとして微調整してタスク解決LLMをガイドする学習パラダイムと、推論アクションの計画の内部化された能力を持つタスク解決LLMを直接微調整する学習パラダイムを提案しています。8つの推論タスクを対象とした実験では、当該手法が一貫して静的推論技術とバニラの指示微調整アプローチを上回ることが示されました。さらなる分析により、当該手法がLLMsに問題の複雑さに基づいて計算を調整させ、難しい問題に対してより深い思考と推論を割り当てることが可能になることが明らかになりました。

English

Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called "reasoning actions"), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

DOTS：最適推論軌跡探索を通じたLLMにおける動的推論学習

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

要旨

Support