ControlLLM: グラフ検索によるツール統合で言語モデルを拡張する

要旨

本論文では、大規模言語モデル（LLM）が複雑な現実世界のタスクを解決するためにマルチモーダルツールを活用できる新しいフレームワーク「ControlLLM」を提案します。LLMの優れた性能にもかかわらず、曖昧なユーザープロンプト、不正確なツール選択とパラメータ設定、非効率なツールスケジューリングにより、ツールの呼び出しには依然として課題があります。これらの課題を克服するため、本フレームワークは以下の3つの主要コンポーネントで構成されています：（1）複雑なタスクを明確な入力と出力を持つサブタスクに分解するタスク分解器、（2）事前に構築されたツールグラフ上で最適な解決パスを探索するThoughts-on-Graph（ToG）パラダイム（このグラフは異なるツール間のパラメータと依存関係を指定）、（3）解決パスを解釈し、異なる計算デバイス上でツールを効率的に実行する豊富なツールボックスを備えた実行エンジン。本フレームワークを画像、音声、ビデオ処理を含む多様なタスクで評価し、既存の手法と比較して優れた精度、効率性、汎用性を実証しました。

English

We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable performance of LLMs, they still struggle with tool invocation due to ambiguous user prompts, inaccurate tool selection and parameterization, and inefficient tool scheduling. To overcome these challenges, our framework comprises three key components: (1) a task decomposer that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a Thoughts-on-Graph (ToG) paradigm that searches the optimal solution path on a pre-built tool graph, which specifies the parameter and dependency relations among different tools; and (3) an execution engine with a rich toolbox that interprets the solution path and runs the tools efficiently on different computational devices. We evaluate our framework on diverse tasks involving image, audio, and video processing, demonstrating its superior accuracy, efficiency, and versatility compared to existing methods.

ControlLLM: グラフ検索によるツール統合で言語モデルを拡張する

ControlLLM: Augment Language Models with Tools by Searching on Graphs

要旨

Support