DeepTravel: 자율 여행 계획 에이전트를 위한 종단 간 에이전트 기반 강화 학습 프레임워크

초록

여행 계획(TP) 에이전트는 최근 여행 일정 생성과 사용자 경험 향상을 위해 외부 도구 및 자원과 상호작용하는 새로운 구성 요소로 주목받고 있습니다. 그러나 기존 연구들은 수작업 프롬프트와 고정된 에이전트 워크플로우에 의존하여 더 유연하고 자율적인 TP 에이전트 개발을 방해하고 있습니다. 본 논문은 자율적인 여행 계획 에이전트를 구축하기 위한 종단 간 강화 학습 프레임워크인 DeepTravel을 제안합니다. DeepTravel은 다단계 추론 과정에서 중간 행동을 탐색, 검증 및 개선하기 위해 자율적으로 계획을 수립하고, 도구를 실행하며, 도구 응답을 반영할 수 있습니다. 이를 위해, 먼저 교통, 숙박 및 관광명소(POI) 데이터를 캐싱하여 실시간 API의 제약(예: 일관되지 않은 출력) 없이 TP 에이전트 훈련을 가능하게 하는 강력한 샌드박스 환경을 구축합니다. 또한, 계층적 보상 모델링 시스템을 개발하여 궤적 수준 검증기가 시공간적 타당성을 확인하고 불만족스러운 여행 일정을 필터링한 후, 턴 수준 검증기가 도구 응답과 일정 세부사항의 일관성을 추가로 검증함으로써 효율적이고 정확한 보상 서비스를 제공합니다. 마지막으로, TP 에이전트가 실패 경험 버퍼에서 주기적으로 재생하며 뛰어난 자율적 역량을 발휘할 수 있도록 하는 응답 증강 강화 학습 방법을 제안합니다. 훈련된 TP 에이전트를 DiDi Enterprise Solutions 앱에 배포하고 온라인 및 오프라인 평가를 종합적으로 수행한 결과, DeepTravel은 Qwen3 32B와 같은 소규모 LLM이 OpenAI o1, o3 및 DeepSeek R1과 같은 최신 LLM을 여행 계획 작업에서 크게 능가할 수 있음을 입증했습니다.

English

Travel planning (TP) agent has recently worked as an emerging building block to interact with external tools and resources for travel itinerary generation, ensuring enjoyable user experience. Despite its benefits, existing studies rely on hand craft prompt and fixed agent workflow, hindering more flexible and autonomous TP agent. This paper proposes DeepTravel, an end to end agentic reinforcement learning framework for building autonomous travel planning agent, capable of autonomously planning, executing tools, and reflecting on tool responses to explore, verify, and refine intermediate actions in multi step reasoning. To achieve this, we first construct a robust sandbox environment by caching transportation, accommodation and POI data, facilitating TP agent training without being constrained by real world APIs limitations (e.g., inconsistent outputs). Moreover, we develop a hierarchical reward modeling system, where a trajectory level verifier first checks spatiotemporal feasibility and filters unsatisfied travel itinerary, and then the turn level verifier further validate itinerary detail consistency with tool responses, enabling efficient and precise reward service. Finally, we propose the reply augmented reinforcement learning method that enables TP agent to periodically replay from a failures experience buffer, emerging notable agentic capacity. We deploy trained TP agent on DiDi Enterprise Solutions App and conduct comprehensive online and offline evaluations, demonstrating that DeepTravel enables small size LLMs (e.g., Qwen3 32B) to significantly outperform existing frontier LLMs such as OpenAI o1, o3 and DeepSeek R1 in travel planning tasks.

DeepTravel: 자율 여행 계획 에이전트를 위한 종단 간 에이전트 기반 강화 학습 프레임워크

DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents

초록

Support