DiaTool-DPO: ツール拡張型大規模言語モデルのためのマルチターン直接選好最適化

要旨

ツール拡張型大規模言語モデル（TA-LLM）は現実世界のアプリケーションで有望な成果を示していますが、不完全なクエリや範囲外のリクエストを扱う際に課題に直面しています。既存のアプローチは主に専門家の軌跡を用いた教師ありファインチューニングに依存していますが、本研究ではDirect Preference Optimization（DPO）を通じてTA-LLMの対話能力を強化する新たな手法、DiaTool-DPOを提案します。TA-LLMのインタラクションを5つの異なる対話状態を持つマルコフ決定過程としてモデル化し、ユーザークエリを状態遷移軌跡に基づいて3つのタイプに分類します。正しい対話フローと誤った対話フローのペア軌跡データセットを自動的に構築し、対話制御のための専用の目的関数を導入します。包括的な評価の結果、DiaTool-DPOはGPT-4oの性能（情報収集で94.8%、ツール呼び出し拒否で91%）に近づき、ベースラインと比較して大幅な改善（それぞれ44%と9.6%）を示しつつ、コア機能を維持することが実証されました。本アプローチは、追加の専門家デモンストレーションや人間によるラベル付けを必要とせず、多様な現実世界のシナリオを扱えるTA-LLMの開発に新たな可能性を開くものです。

English

Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM's dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o's performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

DiaTool-DPO: ツール拡張型大規模言語モデルのためのマルチターン直接選好最適化

DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

要旨

Support