DeepTravel:面向自主旅行规划智能体的端到端强化学习框架
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents
September 26, 2025
作者: Yansong Ning, Rui Liu, Jun Wang, Kai Chen, Wei Li, Jun Fang, Kan Zheng, Naiqiang Tan, Hao Liu
cs.AI
摘要
旅行规划(TP)智能体近来作为一种新兴的构建模块,通过与外部工具和资源的交互生成旅行行程,确保用户获得愉悦的体验。尽管其优势显著,现有研究多依赖于手工设计的提示词和固定的智能体工作流程,限制了TP智能体向更灵活自主的方向发展。本文提出DeepTravel,一种端到端的智能体强化学习框架,旨在构建自主的旅行规划智能体,该智能体能够自主规划、执行工具操作,并基于工具反馈进行反思,以在多步推理中探索、验证并优化中间行动。为实现这一目标,我们首先构建了一个稳健的沙盒环境,通过缓存交通、住宿及兴趣点数据,使TP智能体训练不受现实世界API限制(如输出不一致)的束缚。此外,我们开发了一套分层奖励模型系统,其中轨迹级验证器首先检查时空可行性并筛选不满意的旅行行程,随后回合级验证器进一步核实行程细节与工具响应的一致性,从而实现高效且精准的奖励服务。最后,我们提出了回复增强的强化学习方法,使TP智能体能够周期性地从失败经验缓冲区中回放,显著提升智能体能力。我们将训练后的TP智能体部署于滴滴企业版应用,并进行了全面的线上与线下评估,结果表明DeepTravel使得小型语言模型(如Qwen3 32B)在旅行规划任务中显著超越现有前沿语言模型,如OpenAI o1、o3及DeepSeek R1。
English
Travel planning (TP) agent has recently worked as an emerging building block
to interact with external tools and resources for travel itinerary generation,
ensuring enjoyable user experience. Despite its benefits, existing studies rely
on hand craft prompt and fixed agent workflow, hindering more flexible and
autonomous TP agent. This paper proposes DeepTravel, an end to end agentic
reinforcement learning framework for building autonomous travel planning agent,
capable of autonomously planning, executing tools, and reflecting on tool
responses to explore, verify, and refine intermediate actions in multi step
reasoning. To achieve this, we first construct a robust sandbox environment by
caching transportation, accommodation and POI data, facilitating TP agent
training without being constrained by real world APIs limitations (e.g.,
inconsistent outputs). Moreover, we develop a hierarchical reward modeling
system, where a trajectory level verifier first checks spatiotemporal
feasibility and filters unsatisfied travel itinerary, and then the turn level
verifier further validate itinerary detail consistency with tool responses,
enabling efficient and precise reward service. Finally, we propose the reply
augmented reinforcement learning method that enables TP agent to periodically
replay from a failures experience buffer, emerging notable agentic capacity. We
deploy trained TP agent on DiDi Enterprise Solutions App and conduct
comprehensive online and offline evaluations, demonstrating that DeepTravel
enables small size LLMs (e.g., Qwen3 32B) to significantly outperform existing
frontier LLMs such as OpenAI o1, o3 and DeepSeek R1 in travel planning tasks.