DiaTool-DPO: 도구-강화 대형 언어 모델을 위한 다중 턴 직접 선호도 최적화

초록

도구 강화 대형 언어 모델(TA-LLMs)은 실제 응용 분야에서 유망한 성과를 보여주고 있지만, 불완전한 질의와 범위를 벗어난 요청을 처리하는 데 어려움을 겪고 있습니다. 기존 접근 방식이 주로 전문가 궤적을 활용한 지도 미세 조정에 의존하는 반면, 우리는 직접 선호 최적화(Direct Preference Optimization)를 통해 TA-LLM의 대화 능력을 향상시키는 새로운 방법인 DiaTool-DPO를 제안합니다. 우리는 TA-LLM 상호작용을 5개의 독특한 대화 상태를 가진 마르코프 결정 과정으로 모델링하고, 사용자 질의를 상태 전이 궤적에 따라 3가지 유형으로 분류합니다. 올바른 대화 흐름과 잘못된 대화 흐름의 짝지어진 궤적 데이터셋을 자동으로 구축하고, 대화 제어를 위한 특화된 목적 손실 함수를 도입합니다. 포괄적인 평가 결과, DiaTool-DPO는 GPT-4o의 성능(정보 수집에서 94.8%, 도구 호출 거부에서 91%)에 근접하면서도 기준선 대비 상당한 개선(각각 44%와 9.6%)을 보이며 핵심 기능을 유지합니다. 우리의 접근 방식은 추가적인 전문가 시연이나 인간 라벨링 없이도 다양한 실제 시나리오를 처리할 수 있는 TA-LLM 개발에 새로운 가능성을 열어줍니다.

English

Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM's dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o's performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

DiaTool-DPO: 도구-강화 대형 언어 모델을 위한 다중 턴 직접 선호도 최적화

DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

초록

Support